Upload hftrainer ViMoGen 1.3B HumanML3D artifact
Browse files- README.md +129 -0
- assets/meta/mean.npy +3 -0
- assets/meta/std.npy +3 -0
- model.pt +3 -0
- model_index.json +13 -0
README.md
ADDED
|
@@ -0,0 +1,129 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: hftrainer
|
| 3 |
+
pipeline_tag: other
|
| 4 |
+
tags:
|
| 5 |
+
- motion-generation
|
| 6 |
+
- text-to-motion
|
| 7 |
+
- humanml3d
|
| 8 |
+
- vimogen
|
| 9 |
+
- dart276
|
| 10 |
+
- smpl
|
| 11 |
+
license: other
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
<!-- This model card is synchronized from docs/model_zoo/vimogen.md by tools/sync_model_zoo_cards.py. -->
|
| 15 |
+
|
| 16 |
+
# ViMoGen
|
| 17 |
+
|
| 18 |
+
Text-to-motion baseline integrated into the hftrainer Model Zoo. The runtime is
|
| 19 |
+
hftrainer-native and does not import the upstream repository at inference time:
|
| 20 |
+
the released ViMoGen transformer, scheduler, smoothing step, and DART276 motion
|
| 21 |
+
representation bridge live under `hftrainer.models.motion.vimogen` and
|
| 22 |
+
`hftrainer.motion.representation.dart276`.
|
| 23 |
+
|
| 24 |
+
| | |
|
| 25 |
+
|---|---|
|
| 26 |
+
| **Task** | Text-to-Motion (T2M) |
|
| 27 |
+
| **Bundle / Pipeline** | `ViMoGenBundle` / `ViMoGenPipeline` |
|
| 28 |
+
| **Processed HF artifact** | [`ZeyuLing/hftrainer-vimogen-1.3b-humanml3d`](https://huggingface.co/ZeyuLing/hftrainer-vimogen-1.3b-humanml3d) |
|
| 29 |
+
| **Motion representation** | **DART276** (276-dim, 20 fps), decoded to SMPL `motion_135` for mesh visualization and evaluator bridges |
|
| 30 |
+
| **Backbone** | WanVideoTM2M 1.3B flow-matching DiT, 50 inference steps |
|
| 31 |
+
| **Text encoder** | Wan2.1 T2V-1.3B UMT5-XXL encoder |
|
| 32 |
+
| **Paper** | *ViMoGen: Scaling Full-Body Human Motion Generation through Visual Generative Priors* |
|
| 33 |
+
| **Original code** | https://github.com/MotrixLab/ViMoGen |
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
## Weights
|
| 38 |
+
|
| 39 |
+
Current hftrainer artifact:
|
| 40 |
+
|
| 41 |
+
| Artifact | Location | Contents | Status |
|
| 42 |
+
|---|---|---|---|
|
| 43 |
+
| ViMoGen-DiT 1.3B HumanML3D | [`ZeyuLing/hftrainer-vimogen-1.3b-humanml3d`](https://huggingface.co/ZeyuLing/hftrainer-vimogen-1.3b-humanml3d) | `model.pt` + `model_index.json` + `assets/meta/{mean,std}.npy` | public Hub artifact |
|
| 44 |
+
|
| 45 |
+
Load through the same `from_pretrained` surface as the other reproduced
|
| 46 |
+
baselines:
|
| 47 |
+
|
| 48 |
+
```python
|
| 49 |
+
from hftrainer.pipelines.vimogen import ViMoGenPipeline
|
| 50 |
+
|
| 51 |
+
pipe = ViMoGenPipeline.from_pretrained(
|
| 52 |
+
"ZeyuLing/hftrainer-vimogen-1.3b-humanml3d",
|
| 53 |
+
device="cuda",
|
| 54 |
+
)
|
| 55 |
+
|
| 56 |
+
motions_276 = pipe.infer_t2m(
|
| 57 |
+
["Full-body shot, stable camera. A person walks forward at an average pace."],
|
| 58 |
+
[200],
|
| 59 |
+
seed=0,
|
| 60 |
+
)
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
`ViMoGenBundle.from_pretrained` reads `model_index.json`. If the Wan2.1 base
|
| 64 |
+
assets are not already available locally, the bundle resolves the public
|
| 65 |
+
`Wan-AI/Wan2.1-T2V-1.3B` Hub repo declared by `wan_repo_id`.
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
## Motion Representation
|
| 70 |
+
|
| 71 |
+
ViMoGen emits **DART276**, the global DART-style representation:
|
| 72 |
+
|
| 73 |
+
```
|
| 74 |
+
text -> UMT5-XXL embeddings -> WanVideoTM2M DiT -> denormalized DART276
|
| 75 |
+
-> dart276_to_motion135(...) for SMPL mesh / MotionCLIP / physics
|
| 76 |
+
-> motion135_to_motion272(...) for MotionStreamer-272 evaluator
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
The public conversion API is:
|
| 80 |
+
|
| 81 |
+
```python
|
| 82 |
+
from hftrainer.motion.representation.dart276 import dart276_to_motion135
|
| 83 |
+
|
| 84 |
+
motion_135 = dart276_to_motion135(motion_276, rotation_convention="row")
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
See `docs/motion/representations.md` for the DART276 channel layout and the
|
| 88 |
+
root / coordinate-system convention.
|
| 89 |
+
|
| 90 |
+
---
|
| 91 |
+
|
| 92 |
+
## Evaluation
|
| 93 |
+
|
| 94 |
+
The leaderboard row uses the HumanML3D official-test split (`n=4042`) and the
|
| 95 |
+
shared corrected caption set. ViMoGen is sensitive to terse HumanML3D captions,
|
| 96 |
+
so generation uses a ViMoGen-style prompt rewrite derived from the corrected
|
| 97 |
+
caption. The rewrite adds presentation/context details such as camera, floor,
|
| 98 |
+
and motion-capture clothing while preserving the original action content. The
|
| 99 |
+
semantic evaluators are still computed against the same corrected HumanML3D
|
| 100 |
+
caption protocol used by the other methods.
|
| 101 |
+
|
| 102 |
+
### MotionStreamer-272 and MotionCLIP
|
| 103 |
+
|
| 104 |
+
| Evaluator | R@1 | R@2 | R@3 | FID | MM-Dist | Diversity |
|
| 105 |
+
|---|---:|---:|---:|---:|---:|---:|
|
| 106 |
+
| MotionStreamer-272 (HML round-trip GT) | 0.4291 | 0.5687 | 0.6518 | 152.2095 | 21.0737 | 24.1803 |
|
| 107 |
+
| MotionCLIP-135 no-L2 (HML round-trip GT) | 0.3572 | 0.4992 | 0.5893 | 457.5443 | 44.4103 | 21.6806 |
|
| 108 |
+
|
| 109 |
+
### Physical Diagnostics
|
| 110 |
+
|
| 111 |
+
| Slide | Float | Jitter | Dynamic | Penet |
|
| 112 |
+
|---:|---:|---:|---:|---:|
|
| 113 |
+
| 6.9485 | 23.7270 | 4.4370 | 16.3838 | 0.0000 |
|
| 114 |
+
|
| 115 |
+
---
|
| 116 |
+
|
| 117 |
+
## Implementation Notes
|
| 118 |
+
|
| 119 |
+
- **hftrainer-native runtime**: `hftrainer.models.motion.vimogen.network` vendors
|
| 120 |
+
the required ViMoGen transformer modules and scheduler.
|
| 121 |
+
- **No `ref_repo` dependency**: full-set HumanML3D inference uses
|
| 122 |
+
`scripts/eval/vimogen_t2m_humanml3d.py` with `ViMoGenBundle` /
|
| 123 |
+
`ViMoGenPipeline`.
|
| 124 |
+
- **Prompt sensitivity**: for leaderboard-quality generation, use the
|
| 125 |
+
ViMoGen-style prompt rewrite workflow before inference. The plain corrected
|
| 126 |
+
HumanML3D captions produce substantially weaker text following.
|
| 127 |
+
- **Evaluator bridge**: DART276 outputs are converted to repository
|
| 128 |
+
`motion_135`, then to MotionStreamer-272 or MotionCLIP-135 for the shared
|
| 129 |
+
cross-model leaderboard protocol.
|
assets/meta/mean.npy
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5bc11279ee6b7ee7877f53b0de52c9a579b1ccb1e0a806c7a7406c170035c2ff
|
| 3 |
+
size 1232
|
assets/meta/std.npy
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d2b1ccd6f72c4ad6ac80fd2c3404f49137256b300cffa4e04b21de0ad80da3a8
|
| 3 |
+
size 1232
|
model.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:aaa158aae86b07e09cb4f9a134e61b088b326617f1a90ac25713cc1c859dfd19
|
| 3 |
+
size 4547554888
|
model_index.json
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "vimogen",
|
| 3 |
+
"checkpoint": "model.pt",
|
| 4 |
+
"mean_path": "assets/meta/mean.npy",
|
| 5 |
+
"std_path": "assets/meta/std.npy",
|
| 6 |
+
"motion_representation": "vimogen276",
|
| 7 |
+
"cfg_scale": 5.0,
|
| 8 |
+
"denoising_strength": 0.7,
|
| 9 |
+
"num_inference_steps": 50,
|
| 10 |
+
"text_encoder": "Wan2.1-T2V-1.3B UMT5-XXL",
|
| 11 |
+
"wan_repo_id": "Wan-AI/Wan2.1-T2V-1.3B",
|
| 12 |
+
"source": "ViMoGen released HumanML3D checkpoint"
|
| 13 |
+
}
|