Upload hftrainer ViMoGen 1.3B HumanML3D artifact

03ef622 verified about 21 hours ago

4.63 kB

	---
	library_name: hftrainer
	pipeline_tag: other
	tags:
	- motion-generation
	- text-to-motion
	- humanml3d
	- vimogen
	- dart276
	- smpl
	license: other
	---

	<!-- This model card is synchronized from docs/model_zoo/vimogen.md by tools/sync_model_zoo_cards.py. -->

	# ViMoGen

	Text-to-motion baseline integrated into the hftrainer Model Zoo. The runtime is
	hftrainer-native and does not import the upstream repository at inference time:
	the released ViMoGen transformer, scheduler, smoothing step, and DART276 motion
	representation bridge live under `hftrainer.models.motion.vimogen` and
	`hftrainer.motion.representation.dart276`.

	\| \| \|
	\|---\|---\|
	\| Task \| Text-to-Motion (T2M) \|
	\| Bundle / Pipeline \| `ViMoGenBundle` / `ViMoGenPipeline` \|
	\| Processed HF artifact \| [`ZeyuLing/hftrainer-vimogen-1.3b-humanml3d`](https://huggingface.co/ZeyuLing/hftrainer-vimogen-1.3b-humanml3d) \|
	\| Motion representation \| DART276 (276-dim, 20 fps), decoded to SMPL `motion_135` for mesh visualization and evaluator bridges \|
	\| Backbone \| WanVideoTM2M 1.3B flow-matching DiT, 50 inference steps \|
	\| Text encoder \| Wan2.1 T2V-1.3B UMT5-XXL encoder \|
	\| Paper \| ViMoGen: Scaling Full-Body Human Motion Generation through Visual Generative Priors \|
	\| Original code \| https://github.com/MotrixLab/ViMoGen \|

	---

	## Weights

	Current hftrainer artifact:

	\| Artifact \| Location \| Contents \| Status \|
	\|---\|---\|---\|---\|
	\| ViMoGen-DiT 1.3B HumanML3D \| [`ZeyuLing/hftrainer-vimogen-1.3b-humanml3d`](https://huggingface.co/ZeyuLing/hftrainer-vimogen-1.3b-humanml3d) \| `model.pt` + `model_index.json` + `assets/meta/{mean,std}.npy` \| public Hub artifact \|

	Load through the same `from_pretrained` surface as the other reproduced
	baselines:

	```python
	from hftrainer.pipelines.vimogen import ViMoGenPipeline

	pipe = ViMoGenPipeline.from_pretrained(
	"ZeyuLing/hftrainer-vimogen-1.3b-humanml3d",
	device="cuda",
	)

	motions_276 = pipe.infer_t2m(
	["Full-body shot, stable camera. A person walks forward at an average pace."],
	[200],
	seed=0,
	)
	```

	`ViMoGenBundle.from_pretrained` reads `model_index.json`. If the Wan2.1 base
	assets are not already available locally, the bundle resolves the public
	`Wan-AI/Wan2.1-T2V-1.3B` Hub repo declared by `wan_repo_id`.

	---

	## Motion Representation

	ViMoGen emits DART276, the global DART-style representation:

	```
	text -> UMT5-XXL embeddings -> WanVideoTM2M DiT -> denormalized DART276
	-> dart276_to_motion135(...) for SMPL mesh / MotionCLIP / physics
	-> motion135_to_motion272(...) for MotionStreamer-272 evaluator
	```

	The public conversion API is:

	```python
	from hftrainer.motion.representation.dart276 import dart276_to_motion135

	motion_135 = dart276_to_motion135(motion_276, rotation_convention="row")
	```

	See `docs/motion/representations.md` for the DART276 channel layout and the
	root / coordinate-system convention.

	---

	## Evaluation

	The leaderboard row uses the HumanML3D official-test split (`n=4042`) and the
	shared corrected caption set. ViMoGen is sensitive to terse HumanML3D captions,
	so generation uses a ViMoGen-style prompt rewrite derived from the corrected
	caption. The rewrite adds presentation/context details such as camera, floor,
	and motion-capture clothing while preserving the original action content. The
	semantic evaluators are still computed against the same corrected HumanML3D
	caption protocol used by the other methods.

	### MotionStreamer-272 and MotionCLIP

	\| Evaluator \| R@1 \| R@2 \| R@3 \| FID \| MM-Dist \| Diversity \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| MotionStreamer-272 (HML round-trip GT) \| 0.4291 \| 0.5687 \| 0.6518 \| 152.2095 \| 21.0737 \| 24.1803 \|
	\| MotionCLIP-135 no-L2 (HML round-trip GT) \| 0.3572 \| 0.4992 \| 0.5893 \| 457.5443 \| 44.4103 \| 21.6806 \|

	### Physical Diagnostics

	\| Slide \| Float \| Jitter \| Dynamic \| Penet \|
	\|---:\|---:\|---:\|---:\|---:\|
	\| 6.9485 \| 23.7270 \| 4.4370 \| 16.3838 \| 0.0000 \|

	---

	## Implementation Notes

	- hftrainer-native runtime: `hftrainer.models.motion.vimogen.network` vendors
	the required ViMoGen transformer modules and scheduler.
	- No `ref_repo` dependency: full-set HumanML3D inference uses
	`scripts/eval/vimogen_t2m_humanml3d.py` with `ViMoGenBundle` /
	`ViMoGenPipeline`.
	- Prompt sensitivity: for leaderboard-quality generation, use the
	ViMoGen-style prompt rewrite workflow before inference. The plain corrected
	HumanML3D captions produce substantially weaker text following.
	- Evaluator bridge: DART276 outputs are converted to repository
	`motion_135`, then to MotionStreamer-272 or MotionCLIP-135 for the shared
	cross-model leaderboard protocol.