File size: 4,634 Bytes
03ef622
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
library_name: hftrainer
pipeline_tag: other
tags:
- motion-generation
- text-to-motion
- humanml3d
- vimogen
- dart276
- smpl
license: other
---

<!-- This model card is synchronized from docs/model_zoo/vimogen.md by tools/sync_model_zoo_cards.py. -->

# ViMoGen

Text-to-motion baseline integrated into the hftrainer Model Zoo. The runtime is
hftrainer-native and does not import the upstream repository at inference time:
the released ViMoGen transformer, scheduler, smoothing step, and DART276 motion
representation bridge live under `hftrainer.models.motion.vimogen` and
`hftrainer.motion.representation.dart276`.

| | |
|---|---|
| **Task** | Text-to-Motion (T2M) |
| **Bundle / Pipeline** | `ViMoGenBundle` / `ViMoGenPipeline` |
| **Processed HF artifact** | [`ZeyuLing/hftrainer-vimogen-1.3b-humanml3d`](https://huggingface.co/ZeyuLing/hftrainer-vimogen-1.3b-humanml3d) |
| **Motion representation** | **DART276** (276-dim, 20 fps), decoded to SMPL `motion_135` for mesh visualization and evaluator bridges |
| **Backbone** | WanVideoTM2M 1.3B flow-matching DiT, 50 inference steps |
| **Text encoder** | Wan2.1 T2V-1.3B UMT5-XXL encoder |
| **Paper** | *ViMoGen: Scaling Full-Body Human Motion Generation through Visual Generative Priors* |
| **Original code** | https://github.com/MotrixLab/ViMoGen |

---

## Weights

Current hftrainer artifact:

| Artifact | Location | Contents | Status |
|---|---|---|---|
| ViMoGen-DiT 1.3B HumanML3D | [`ZeyuLing/hftrainer-vimogen-1.3b-humanml3d`](https://huggingface.co/ZeyuLing/hftrainer-vimogen-1.3b-humanml3d) | `model.pt` + `model_index.json` + `assets/meta/{mean,std}.npy` | public Hub artifact |

Load through the same `from_pretrained` surface as the other reproduced
baselines:

```python
from hftrainer.pipelines.vimogen import ViMoGenPipeline

pipe = ViMoGenPipeline.from_pretrained(
    "ZeyuLing/hftrainer-vimogen-1.3b-humanml3d",
    device="cuda",
)

motions_276 = pipe.infer_t2m(
    ["Full-body shot, stable camera. A person walks forward at an average pace."],
    [200],
    seed=0,
)
```

`ViMoGenBundle.from_pretrained` reads `model_index.json`. If the Wan2.1 base
assets are not already available locally, the bundle resolves the public
`Wan-AI/Wan2.1-T2V-1.3B` Hub repo declared by `wan_repo_id`.

---

## Motion Representation

ViMoGen emits **DART276**, the global DART-style representation:

```
text -> UMT5-XXL embeddings -> WanVideoTM2M DiT -> denormalized DART276
     -> dart276_to_motion135(...) for SMPL mesh / MotionCLIP / physics
     -> motion135_to_motion272(...) for MotionStreamer-272 evaluator
```

The public conversion API is:

```python
from hftrainer.motion.representation.dart276 import dart276_to_motion135

motion_135 = dart276_to_motion135(motion_276, rotation_convention="row")
```

See `docs/motion/representations.md` for the DART276 channel layout and the
root / coordinate-system convention.

---

## Evaluation

The leaderboard row uses the HumanML3D official-test split (`n=4042`) and the
shared corrected caption set. ViMoGen is sensitive to terse HumanML3D captions,
so generation uses a ViMoGen-style prompt rewrite derived from the corrected
caption. The rewrite adds presentation/context details such as camera, floor,
and motion-capture clothing while preserving the original action content. The
semantic evaluators are still computed against the same corrected HumanML3D
caption protocol used by the other methods.

### MotionStreamer-272 and MotionCLIP

| Evaluator | R@1 | R@2 | R@3 | FID | MM-Dist | Diversity |
|---|---:|---:|---:|---:|---:|---:|
| MotionStreamer-272 (HML round-trip GT) | 0.4291 | 0.5687 | 0.6518 | 152.2095 | 21.0737 | 24.1803 |
| MotionCLIP-135 no-L2 (HML round-trip GT) | 0.3572 | 0.4992 | 0.5893 | 457.5443 | 44.4103 | 21.6806 |

### Physical Diagnostics

| Slide | Float | Jitter | Dynamic | Penet |
|---:|---:|---:|---:|---:|
| 6.9485 | 23.7270 | 4.4370 | 16.3838 | 0.0000 |

---

## Implementation Notes

- **hftrainer-native runtime**: `hftrainer.models.motion.vimogen.network` vendors
  the required ViMoGen transformer modules and scheduler.
- **No `ref_repo` dependency**: full-set HumanML3D inference uses
  `scripts/eval/vimogen_t2m_humanml3d.py` with `ViMoGenBundle` /
  `ViMoGenPipeline`.
- **Prompt sensitivity**: for leaderboard-quality generation, use the
  ViMoGen-style prompt rewrite workflow before inference. The plain corrected
  HumanML3D captions produce substantially weaker text following.
- **Evaluator bridge**: DART276 outputs are converted to repository
  `motion_135`, then to MotionStreamer-272 or MotionCLIP-135 for the shared
  cross-model leaderboard protocol.