File size: 4,634 Bytes
03ef622 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | ---
library_name: hftrainer
pipeline_tag: other
tags:
- motion-generation
- text-to-motion
- humanml3d
- vimogen
- dart276
- smpl
license: other
---
<!-- This model card is synchronized from docs/model_zoo/vimogen.md by tools/sync_model_zoo_cards.py. -->
# ViMoGen
Text-to-motion baseline integrated into the hftrainer Model Zoo. The runtime is
hftrainer-native and does not import the upstream repository at inference time:
the released ViMoGen transformer, scheduler, smoothing step, and DART276 motion
representation bridge live under `hftrainer.models.motion.vimogen` and
`hftrainer.motion.representation.dart276`.
| | |
|---|---|
| **Task** | Text-to-Motion (T2M) |
| **Bundle / Pipeline** | `ViMoGenBundle` / `ViMoGenPipeline` |
| **Processed HF artifact** | [`ZeyuLing/hftrainer-vimogen-1.3b-humanml3d`](https://huggingface.co/ZeyuLing/hftrainer-vimogen-1.3b-humanml3d) |
| **Motion representation** | **DART276** (276-dim, 20 fps), decoded to SMPL `motion_135` for mesh visualization and evaluator bridges |
| **Backbone** | WanVideoTM2M 1.3B flow-matching DiT, 50 inference steps |
| **Text encoder** | Wan2.1 T2V-1.3B UMT5-XXL encoder |
| **Paper** | *ViMoGen: Scaling Full-Body Human Motion Generation through Visual Generative Priors* |
| **Original code** | https://github.com/MotrixLab/ViMoGen |
---
## Weights
Current hftrainer artifact:
| Artifact | Location | Contents | Status |
|---|---|---|---|
| ViMoGen-DiT 1.3B HumanML3D | [`ZeyuLing/hftrainer-vimogen-1.3b-humanml3d`](https://huggingface.co/ZeyuLing/hftrainer-vimogen-1.3b-humanml3d) | `model.pt` + `model_index.json` + `assets/meta/{mean,std}.npy` | public Hub artifact |
Load through the same `from_pretrained` surface as the other reproduced
baselines:
```python
from hftrainer.pipelines.vimogen import ViMoGenPipeline
pipe = ViMoGenPipeline.from_pretrained(
"ZeyuLing/hftrainer-vimogen-1.3b-humanml3d",
device="cuda",
)
motions_276 = pipe.infer_t2m(
["Full-body shot, stable camera. A person walks forward at an average pace."],
[200],
seed=0,
)
```
`ViMoGenBundle.from_pretrained` reads `model_index.json`. If the Wan2.1 base
assets are not already available locally, the bundle resolves the public
`Wan-AI/Wan2.1-T2V-1.3B` Hub repo declared by `wan_repo_id`.
---
## Motion Representation
ViMoGen emits **DART276**, the global DART-style representation:
```
text -> UMT5-XXL embeddings -> WanVideoTM2M DiT -> denormalized DART276
-> dart276_to_motion135(...) for SMPL mesh / MotionCLIP / physics
-> motion135_to_motion272(...) for MotionStreamer-272 evaluator
```
The public conversion API is:
```python
from hftrainer.motion.representation.dart276 import dart276_to_motion135
motion_135 = dart276_to_motion135(motion_276, rotation_convention="row")
```
See `docs/motion/representations.md` for the DART276 channel layout and the
root / coordinate-system convention.
---
## Evaluation
The leaderboard row uses the HumanML3D official-test split (`n=4042`) and the
shared corrected caption set. ViMoGen is sensitive to terse HumanML3D captions,
so generation uses a ViMoGen-style prompt rewrite derived from the corrected
caption. The rewrite adds presentation/context details such as camera, floor,
and motion-capture clothing while preserving the original action content. The
semantic evaluators are still computed against the same corrected HumanML3D
caption protocol used by the other methods.
### MotionStreamer-272 and MotionCLIP
| Evaluator | R@1 | R@2 | R@3 | FID | MM-Dist | Diversity |
|---|---:|---:|---:|---:|---:|---:|
| MotionStreamer-272 (HML round-trip GT) | 0.4291 | 0.5687 | 0.6518 | 152.2095 | 21.0737 | 24.1803 |
| MotionCLIP-135 no-L2 (HML round-trip GT) | 0.3572 | 0.4992 | 0.5893 | 457.5443 | 44.4103 | 21.6806 |
### Physical Diagnostics
| Slide | Float | Jitter | Dynamic | Penet |
|---:|---:|---:|---:|---:|
| 6.9485 | 23.7270 | 4.4370 | 16.3838 | 0.0000 |
---
## Implementation Notes
- **hftrainer-native runtime**: `hftrainer.models.motion.vimogen.network` vendors
the required ViMoGen transformer modules and scheduler.
- **No `ref_repo` dependency**: full-set HumanML3D inference uses
`scripts/eval/vimogen_t2m_humanml3d.py` with `ViMoGenBundle` /
`ViMoGenPipeline`.
- **Prompt sensitivity**: for leaderboard-quality generation, use the
ViMoGen-style prompt rewrite workflow before inference. The plain corrected
HumanML3D captions produce substantially weaker text following.
- **Evaluator bridge**: DART276 outputs are converted to repository
`motion_135`, then to MotionStreamer-272 or MotionCLIP-135 for the shared
cross-model leaderboard protocol.
|