Upload hftrainer ViMoGen 1.3B HumanML3D artifact

03ef622 verified about 20 hours ago

4.63 kB

library_name: hftrainer
pipeline_tag: other
tags:
  - motion-generation
  - text-to-motion
  - humanml3d
  - vimogen
  - dart276
  - smpl
license: other

ViMoGen

Text-to-motion baseline integrated into the hftrainer Model Zoo. The runtime is hftrainer-native and does not import the upstream repository at inference time: the released ViMoGen transformer, scheduler, smoothing step, and DART276 motion representation bridge live under hftrainer.models.motion.vimogen and hftrainer.motion.representation.dart276.


Task	Text-to-Motion (T2M)
Bundle / Pipeline	`ViMoGenBundle` / `ViMoGenPipeline`
Processed HF artifact	`ZeyuLing/hftrainer-vimogen-1.3b-humanml3d`
Motion representation	DART276 (276-dim, 20 fps), decoded to SMPL `motion_135` for mesh visualization and evaluator bridges
Backbone	WanVideoTM2M 1.3B flow-matching DiT, 50 inference steps
Text encoder	Wan2.1 T2V-1.3B UMT5-XXL encoder
Paper	ViMoGen: Scaling Full-Body Human Motion Generation through Visual Generative Priors
Original code	https://github.com/MotrixLab/ViMoGen

Weights

Current hftrainer artifact:

Artifact	Location	Contents	Status
ViMoGen-DiT 1.3B HumanML3D	`ZeyuLing/hftrainer-vimogen-1.3b-humanml3d`	`model.pt` + `model_index.json` + `assets/meta/{mean,std}.npy`	public Hub artifact

Load through the same from_pretrained surface as the other reproduced baselines:

from hftrainer.pipelines.vimogen import ViMoGenPipeline

pipe = ViMoGenPipeline.from_pretrained(
    "ZeyuLing/hftrainer-vimogen-1.3b-humanml3d",
    device="cuda",
)

motions_276 = pipe.infer_t2m(
    ["Full-body shot, stable camera. A person walks forward at an average pace."],
    [200],
    seed=0,
)

ViMoGenBundle.from_pretrained reads model_index.json. If the Wan2.1 base assets are not already available locally, the bundle resolves the public Wan-AI/Wan2.1-T2V-1.3B Hub repo declared by wan_repo_id.

Motion Representation

ViMoGen emits DART276, the global DART-style representation:

text -> UMT5-XXL embeddings -> WanVideoTM2M DiT -> denormalized DART276
     -> dart276_to_motion135(...) for SMPL mesh / MotionCLIP / physics
     -> motion135_to_motion272(...) for MotionStreamer-272 evaluator

The public conversion API is:

from hftrainer.motion.representation.dart276 import dart276_to_motion135

motion_135 = dart276_to_motion135(motion_276, rotation_convention="row")

See docs/motion/representations.md for the DART276 channel layout and the root / coordinate-system convention.

Evaluation

The leaderboard row uses the HumanML3D official-test split (n=4042) and the shared corrected caption set. ViMoGen is sensitive to terse HumanML3D captions, so generation uses a ViMoGen-style prompt rewrite derived from the corrected caption. The rewrite adds presentation/context details such as camera, floor, and motion-capture clothing while preserving the original action content. The semantic evaluators are still computed against the same corrected HumanML3D caption protocol used by the other methods.

MotionStreamer-272 and MotionCLIP

Evaluator	R@1	R@2	R@3	FID	MM-Dist	Diversity
MotionStreamer-272 (HML round-trip GT)	0.4291	0.5687	0.6518	152.2095	21.0737	24.1803
MotionCLIP-135 no-L2 (HML round-trip GT)	0.3572	0.4992	0.5893	457.5443	44.4103	21.6806

Physical Diagnostics

Slide	Float	Jitter	Dynamic	Penet
6.9485	23.7270	4.4370	16.3838	0.0000

Implementation Notes

hftrainer-native runtime: hftrainer.models.motion.vimogen.network vendors the required ViMoGen transformer modules and scheduler.
No ref_repo dependency: full-set HumanML3D inference uses scripts/eval/vimogen_t2m_humanml3d.py with ViMoGenBundle / ViMoGenPipeline.
Prompt sensitivity: for leaderboard-quality generation, use the ViMoGen-style prompt rewrite workflow before inference. The plain corrected HumanML3D captions produce substantially weaker text following.
Evaluator bridge: DART276 outputs are converted to repository motion_135, then to MotionStreamer-272 or MotionCLIP-135 for the shared cross-model leaderboard protocol.