ZeyuLing commited on
Commit
03ef622
·
verified ·
1 Parent(s): de506da

Upload hftrainer ViMoGen 1.3B HumanML3D artifact

Browse files
Files changed (5) hide show
  1. README.md +129 -0
  2. assets/meta/mean.npy +3 -0
  3. assets/meta/std.npy +3 -0
  4. model.pt +3 -0
  5. model_index.json +13 -0
README.md ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: hftrainer
3
+ pipeline_tag: other
4
+ tags:
5
+ - motion-generation
6
+ - text-to-motion
7
+ - humanml3d
8
+ - vimogen
9
+ - dart276
10
+ - smpl
11
+ license: other
12
+ ---
13
+
14
+ <!-- This model card is synchronized from docs/model_zoo/vimogen.md by tools/sync_model_zoo_cards.py. -->
15
+
16
+ # ViMoGen
17
+
18
+ Text-to-motion baseline integrated into the hftrainer Model Zoo. The runtime is
19
+ hftrainer-native and does not import the upstream repository at inference time:
20
+ the released ViMoGen transformer, scheduler, smoothing step, and DART276 motion
21
+ representation bridge live under `hftrainer.models.motion.vimogen` and
22
+ `hftrainer.motion.representation.dart276`.
23
+
24
+ | | |
25
+ |---|---|
26
+ | **Task** | Text-to-Motion (T2M) |
27
+ | **Bundle / Pipeline** | `ViMoGenBundle` / `ViMoGenPipeline` |
28
+ | **Processed HF artifact** | [`ZeyuLing/hftrainer-vimogen-1.3b-humanml3d`](https://huggingface.co/ZeyuLing/hftrainer-vimogen-1.3b-humanml3d) |
29
+ | **Motion representation** | **DART276** (276-dim, 20 fps), decoded to SMPL `motion_135` for mesh visualization and evaluator bridges |
30
+ | **Backbone** | WanVideoTM2M 1.3B flow-matching DiT, 50 inference steps |
31
+ | **Text encoder** | Wan2.1 T2V-1.3B UMT5-XXL encoder |
32
+ | **Paper** | *ViMoGen: Scaling Full-Body Human Motion Generation through Visual Generative Priors* |
33
+ | **Original code** | https://github.com/MotrixLab/ViMoGen |
34
+
35
+ ---
36
+
37
+ ## Weights
38
+
39
+ Current hftrainer artifact:
40
+
41
+ | Artifact | Location | Contents | Status |
42
+ |---|---|---|---|
43
+ | ViMoGen-DiT 1.3B HumanML3D | [`ZeyuLing/hftrainer-vimogen-1.3b-humanml3d`](https://huggingface.co/ZeyuLing/hftrainer-vimogen-1.3b-humanml3d) | `model.pt` + `model_index.json` + `assets/meta/{mean,std}.npy` | public Hub artifact |
44
+
45
+ Load through the same `from_pretrained` surface as the other reproduced
46
+ baselines:
47
+
48
+ ```python
49
+ from hftrainer.pipelines.vimogen import ViMoGenPipeline
50
+
51
+ pipe = ViMoGenPipeline.from_pretrained(
52
+ "ZeyuLing/hftrainer-vimogen-1.3b-humanml3d",
53
+ device="cuda",
54
+ )
55
+
56
+ motions_276 = pipe.infer_t2m(
57
+ ["Full-body shot, stable camera. A person walks forward at an average pace."],
58
+ [200],
59
+ seed=0,
60
+ )
61
+ ```
62
+
63
+ `ViMoGenBundle.from_pretrained` reads `model_index.json`. If the Wan2.1 base
64
+ assets are not already available locally, the bundle resolves the public
65
+ `Wan-AI/Wan2.1-T2V-1.3B` Hub repo declared by `wan_repo_id`.
66
+
67
+ ---
68
+
69
+ ## Motion Representation
70
+
71
+ ViMoGen emits **DART276**, the global DART-style representation:
72
+
73
+ ```
74
+ text -> UMT5-XXL embeddings -> WanVideoTM2M DiT -> denormalized DART276
75
+ -> dart276_to_motion135(...) for SMPL mesh / MotionCLIP / physics
76
+ -> motion135_to_motion272(...) for MotionStreamer-272 evaluator
77
+ ```
78
+
79
+ The public conversion API is:
80
+
81
+ ```python
82
+ from hftrainer.motion.representation.dart276 import dart276_to_motion135
83
+
84
+ motion_135 = dart276_to_motion135(motion_276, rotation_convention="row")
85
+ ```
86
+
87
+ See `docs/motion/representations.md` for the DART276 channel layout and the
88
+ root / coordinate-system convention.
89
+
90
+ ---
91
+
92
+ ## Evaluation
93
+
94
+ The leaderboard row uses the HumanML3D official-test split (`n=4042`) and the
95
+ shared corrected caption set. ViMoGen is sensitive to terse HumanML3D captions,
96
+ so generation uses a ViMoGen-style prompt rewrite derived from the corrected
97
+ caption. The rewrite adds presentation/context details such as camera, floor,
98
+ and motion-capture clothing while preserving the original action content. The
99
+ semantic evaluators are still computed against the same corrected HumanML3D
100
+ caption protocol used by the other methods.
101
+
102
+ ### MotionStreamer-272 and MotionCLIP
103
+
104
+ | Evaluator | R@1 | R@2 | R@3 | FID | MM-Dist | Diversity |
105
+ |---|---:|---:|---:|---:|---:|---:|
106
+ | MotionStreamer-272 (HML round-trip GT) | 0.4291 | 0.5687 | 0.6518 | 152.2095 | 21.0737 | 24.1803 |
107
+ | MotionCLIP-135 no-L2 (HML round-trip GT) | 0.3572 | 0.4992 | 0.5893 | 457.5443 | 44.4103 | 21.6806 |
108
+
109
+ ### Physical Diagnostics
110
+
111
+ | Slide | Float | Jitter | Dynamic | Penet |
112
+ |---:|---:|---:|---:|---:|
113
+ | 6.9485 | 23.7270 | 4.4370 | 16.3838 | 0.0000 |
114
+
115
+ ---
116
+
117
+ ## Implementation Notes
118
+
119
+ - **hftrainer-native runtime**: `hftrainer.models.motion.vimogen.network` vendors
120
+ the required ViMoGen transformer modules and scheduler.
121
+ - **No `ref_repo` dependency**: full-set HumanML3D inference uses
122
+ `scripts/eval/vimogen_t2m_humanml3d.py` with `ViMoGenBundle` /
123
+ `ViMoGenPipeline`.
124
+ - **Prompt sensitivity**: for leaderboard-quality generation, use the
125
+ ViMoGen-style prompt rewrite workflow before inference. The plain corrected
126
+ HumanML3D captions produce substantially weaker text following.
127
+ - **Evaluator bridge**: DART276 outputs are converted to repository
128
+ `motion_135`, then to MotionStreamer-272 or MotionCLIP-135 for the shared
129
+ cross-model leaderboard protocol.
assets/meta/mean.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5bc11279ee6b7ee7877f53b0de52c9a579b1ccb1e0a806c7a7406c170035c2ff
3
+ size 1232
assets/meta/std.npy ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d2b1ccd6f72c4ad6ac80fd2c3404f49137256b300cffa4e04b21de0ad80da3a8
3
+ size 1232
model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aaa158aae86b07e09cb4f9a134e61b088b326617f1a90ac25713cc1c859dfd19
3
+ size 4547554888
model_index.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "vimogen",
3
+ "checkpoint": "model.pt",
4
+ "mean_path": "assets/meta/mean.npy",
5
+ "std_path": "assets/meta/std.npy",
6
+ "motion_representation": "vimogen276",
7
+ "cfg_scale": 5.0,
8
+ "denoising_strength": 0.7,
9
+ "num_inference_steps": 50,
10
+ "text_encoder": "Wan2.1-T2V-1.3B UMT5-XXL",
11
+ "wan_repo_id": "Wan-AI/Wan2.1-T2V-1.3B",
12
+ "source": "ViMoGen released HumanML3D checkpoint"
13
+ }