Upload hftrainer ViMoGen 1.3B HumanML3D artifact

Browse files

Files changed (5) hide show

README.md +129 -0
assets/meta/mean.npy +3 -0
assets/meta/std.npy +3 -0
model.pt +3 -0
model_index.json +13 -0

README.md ADDED Viewed

	@@ -0,0 +1,129 @@

+---
+library_name: hftrainer
+pipeline_tag: other
+tags:
+- motion-generation
+- text-to-motion
+- humanml3d
+- vimogen
+- dart276
+- smpl
+license: other
+---
+<!-- This model card is synchronized from docs/model_zoo/vimogen.md by tools/sync_model_zoo_cards.py. -->
+# ViMoGen
+Text-to-motion baseline integrated into the hftrainer Model Zoo. The runtime is
+hftrainer-native and does not import the upstream repository at inference time:
+the released ViMoGen transformer, scheduler, smoothing step, and DART276 motion
+representation bridge live under `hftrainer.models.motion.vimogen` and
+`hftrainer.motion.representation.dart276`.
+| | |
+|---|---|
+| **Task** | Text-to-Motion (T2M) |
+| **Bundle / Pipeline** | `ViMoGenBundle` / `ViMoGenPipeline` |
+| **Processed HF artifact** | [`ZeyuLing/hftrainer-vimogen-1.3b-humanml3d`](https://huggingface.co/ZeyuLing/hftrainer-vimogen-1.3b-humanml3d) |
+| **Motion representation** | **DART276** (276-dim, 20 fps), decoded to SMPL `motion_135` for mesh visualization and evaluator bridges |
+| **Backbone** | WanVideoTM2M 1.3B flow-matching DiT, 50 inference steps |
+| **Text encoder** | Wan2.1 T2V-1.3B UMT5-XXL encoder |
+| **Paper** | *ViMoGen: Scaling Full-Body Human Motion Generation through Visual Generative Priors* |
+| **Original code** | https://github.com/MotrixLab/ViMoGen |
+---
+## Weights
+Current hftrainer artifact:
+| Artifact | Location | Contents | Status |
+|---|---|---|---|
+| ViMoGen-DiT 1.3B HumanML3D | [`ZeyuLing/hftrainer-vimogen-1.3b-humanml3d`](https://huggingface.co/ZeyuLing/hftrainer-vimogen-1.3b-humanml3d) | `model.pt` + `model_index.json` + `assets/meta/{mean,std}.npy` | public Hub artifact |
+Load through the same `from_pretrained` surface as the other reproduced
+baselines:
+```python
+from hftrainer.pipelines.vimogen import ViMoGenPipeline
+pipe = ViMoGenPipeline.from_pretrained(
+    "ZeyuLing/hftrainer-vimogen-1.3b-humanml3d",
+    device="cuda",
+)
+motions_276 = pipe.infer_t2m(
+    ["Full-body shot, stable camera. A person walks forward at an average pace."],
+    [200],
+    seed=0,
+)
+```
+`ViMoGenBundle.from_pretrained` reads `model_index.json`. If the Wan2.1 base
+assets are not already available locally, the bundle resolves the public
+`Wan-AI/Wan2.1-T2V-1.3B` Hub repo declared by `wan_repo_id`.
+---
+## Motion Representation
+ViMoGen emits **DART276**, the global DART-style representation:
+```
+text -> UMT5-XXL embeddings -> WanVideoTM2M DiT -> denormalized DART276
+     -> dart276_to_motion135(...) for SMPL mesh / MotionCLIP / physics
+     -> motion135_to_motion272(...) for MotionStreamer-272 evaluator
+```
+The public conversion API is:
+```python
+from hftrainer.motion.representation.dart276 import dart276_to_motion135
+motion_135 = dart276_to_motion135(motion_276, rotation_convention="row")
+```
+See `docs/motion/representations.md` for the DART276 channel layout and the
+root / coordinate-system convention.
+---
+## Evaluation
+The leaderboard row uses the HumanML3D official-test split (`n=4042`) and the
+shared corrected caption set. ViMoGen is sensitive to terse HumanML3D captions,
+so generation uses a ViMoGen-style prompt rewrite derived from the corrected
+caption. The rewrite adds presentation/context details such as camera, floor,
+and motion-capture clothing while preserving the original action content. The
+semantic evaluators are still computed against the same corrected HumanML3D
+caption protocol used by the other methods.
+### MotionStreamer-272 and MotionCLIP
+| Evaluator | R@1 | R@2 | R@3 | FID | MM-Dist | Diversity |
+|---|---:|---:|---:|---:|---:|---:|
+| MotionStreamer-272 (HML round-trip GT) | 0.4291 | 0.5687 | 0.6518 | 152.2095 | 21.0737 | 24.1803 |
+| MotionCLIP-135 no-L2 (HML round-trip GT) | 0.3572 | 0.4992 | 0.5893 | 457.5443 | 44.4103 | 21.6806 |
+### Physical Diagnostics
+| Slide | Float | Jitter | Dynamic | Penet |
+|---:|---:|---:|---:|---:|
+| 6.9485 | 23.7270 | 4.4370 | 16.3838 | 0.0000 |
+---
+## Implementation Notes
+- **hftrainer-native runtime**: `hftrainer.models.motion.vimogen.network` vendors
+  the required ViMoGen transformer modules and scheduler.
+- **No `ref_repo` dependency**: full-set HumanML3D inference uses
+  `scripts/eval/vimogen_t2m_humanml3d.py` with `ViMoGenBundle` /
+  `ViMoGenPipeline`.
+- **Prompt sensitivity**: for leaderboard-quality generation, use the
+  ViMoGen-style prompt rewrite workflow before inference. The plain corrected
+  HumanML3D captions produce substantially weaker text following.
+- **Evaluator bridge**: DART276 outputs are converted to repository
+  `motion_135`, then to MotionStreamer-272 or MotionCLIP-135 for the shared
+  cross-model leaderboard protocol.

assets/meta/mean.npy ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5bc11279ee6b7ee7877f53b0de52c9a579b1ccb1e0a806c7a7406c170035c2ff
+size 1232

assets/meta/std.npy ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d2b1ccd6f72c4ad6ac80fd2c3404f49137256b300cffa4e04b21de0ad80da3a8
+size 1232

model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aaa158aae86b07e09cb4f9a134e61b088b326617f1a90ac25713cc1c859dfd19
+size 4547554888

model_index.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "model_type": "vimogen",
+  "checkpoint": "model.pt",
+  "mean_path": "assets/meta/mean.npy",
+  "std_path": "assets/meta/std.npy",
+  "motion_representation": "vimogen276",
+  "cfg_scale": 5.0,
+  "denoising_strength": 0.7,
+  "num_inference_steps": 50,
+  "text_encoder": "Wan2.1-T2V-1.3B UMT5-XXL",
+  "wan_repo_id": "Wan-AI/Wan2.1-T2V-1.3B",
+  "source": "ViMoGen released HumanML3D checkpoint"
+}