MotionLCM - Latent Consistency Model for Human Motion Generation

Text-to-motion baseline integrated into the hftrainer Model Zoo. Our reproduction keeps the MLD motion VAE, latent consistency denoiser, LCM scheduler wiring, and SentenceT5 text wrapper in the native hftrainer.models.motion.motionlcm.network package, so inference no longer imports the upstream repository at runtime.

Task Text-to-Motion (T2M)
Bundle / Pipeline MotionLCMBundle / MotionLCMPipeline
Motion representation HumanML3D-263 (263-dim, 20 fps, 22 joints)
Backbone MLD VAE + latent consistency denoiser, default 1 LCM step
Text encoder SentenceT5-Large (sentence-transformers/sentence-t5-large, frozen)
Paper MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model, Dai et al., ECCV 2024
Original code https://github.com/Dai-Wenxun/MotionLCM

Weights

Current hftrainer artifact:

Artifact Location Contents Status
MotionLCM HumanML3D ZeyuLing/hftrainer-motionlcm-humanml3d / checkpoints/motionlcm/humanml3d vae.safetensors + denoiser.safetensors + motionlcm_config.json + Mean.npy / Std.npy uploaded hftrainer artifact; v1 benchmark checkpoint

The local artifact reloads through the same from_pretrained surface as the published model-zoo checkpoints:

from hftrainer.pipelines.motionlcm import MotionLCMPipeline

pipe = MotionLCMPipeline.from_pretrained(
    "ZeyuLing/hftrainer-motionlcm-humanml3d",
    device="cuda",
)
motions = pipe.infer_t2m(
    ["a person walks forward then sits down"],
    [120],
    num_inference_steps=1,
)

Package the artifact from upstream checkpoints:

python3 scripts/eval/convert_motionlcm_checkpoint.py \
    --vae_ckpt ref_repo/MotionLCM/experiments_t2m/mld_humanml/mld_humanml_v1.ckpt \
    --denoiser_ckpt ref_repo/MotionLCM/experiments_t2m/motionlcm_humanml/motionlcm_humanml_v1.ckpt \
    --out_dir checkpoints/motionlcm/humanml3d \
    --verify

The frozen SentenceT5-Large encoder is resolved by name rather than duplicated inside the artifact. For fully offline use, snapshot the text encoder into the local Hugging Face cache before calling from_pretrained.

The published artifact uses the upstream v1 benchmark checkpoints: mld_humanml_v1.ckpt and motionlcm_humanml_v1.ckpt. These are the one-token latent checkpoints (latent_dim=[1, 256]) compatible with the official T2M test config. The non-v1 files in the same upstream folder are a different sixteen-token latent family and should not be treated as the model-card benchmark artifact.


Motion representation

HumanML3D-263, the standard redundant T2M feature (Guo et al.), 20 fps, 22-joint SMPL skeleton. Per frame (263 dims):

Slice Dim Meaning
root_rot_vel 1 root angular velocity (about Y)
root_lin_vel 2 root linear velocity (XZ plane)
root_y 1 root height
ric_data 63 local joint positions (21x3)
rot_data 126 local joint rotations (21x6, cont. 6D)
local_vel 66 local joint velocities (22x3)
foot_contact 4 binary foot-contact labels

MotionLCM samples in the MLD latent space and decodes directly back to HumanML3D-263. Convert to SMPL or MotionStreamer-272 with hftrainer.motion.representation.convert when cross-model comparison requires another evaluator space.


Evaluation

Generation follows the shared HumanML3D official-test protocol used by the leaderboard: 4042 official test ids, corrected selected captions under outputs/evaluation/t2m/humanml3d_official_test/captions/gt_motionclip_selected_20260622/, native 263-dim at 20 fps, and one prediction per test id.

python3 scripts/eval/motionlcm_t2m_h3d263.py \
    --anno_file outputs/evaluation/t2m/humanml3d_official_test/captions/gt_motionclip_selected_20260622/test_hml3d_official272_gtlen_motionclip_selected_caption.json \
    --anno_data_dir . \
    --model_path checkpoints/motionlcm/humanml3d \
    --num_inference_steps 1 \
    --out_dir outputs/evaluation/t2m/humanml3d_official_test/hml263/motionlcm

The full reproduction pipeline writes the canonical outputs:

Representation Canonical path
HML263 outputs/evaluation/t2m/humanml3d_official_test/hml263/motionlcm
SMPL motion_135 outputs/evaluation/t2m/humanml3d_official_test/motion135/motionlcm
MotionStreamer-272 outputs/evaluation/t2m/humanml3d_official_test/ms272/motionlcm

Run the Taiji wrapper for full generation, conversion, and evaluators:

python3 scripts/submit/submit_motionlcm_hml3d_full_taiji.py \
    --gpu V100 \
    --num-gpus 8 \
    --num-inference-steps 1 \
    --elastic

Report the LCM step count (--num_inference_steps) alongside any metrics. The model-zoo table should use metrics copied from the generated evaluator JSONs under outputs/evaluation/t2m/humanml3d_official_test/_runs/<run>/metrics/, not handwritten values. For HumanML3D-263 semantic metrics, the evaluator texts_dir must match the captions used for generation. The selected-caption official-test run is scored with outputs/evaluation/t2m/humanml3d_official_test/captions/gt_motionclip_selected_20260622/texts; scoring these outputs against the older CondMDI text files produces mismatched R-Precision / MM-Dist.

Current HumanML3D official-test metrics (4042 generated motions, selected caption protocol, NFE=1):

Evaluator R@1 R@2 R@3 FID MM-Dist Diversity
HumanML3D-263 (selected captions) 0.5093 0.7080 0.8108 0.3396 2.9694 9.6407
MotionStreamer-272 (HML roundtrip GT) 0.5657 0.7346 0.8075 44.0549 19.4543 24.6395
MotionCLIP-135 no-L2 (HML roundtrip GT) 0.3620 0.5157 0.6078 146.7212 42.6430 22.9160

Physical diagnostics on SMPL motion_135: Slide 4.2898, Float 19.1150, Jitter 3.2493, Dynamic 19.8250.

As with other native HML263 baselines, the MS272 row includes a representation bridge (HML263 -> SMPL motion_135 via IK refine-80 -> MotionStreamer-272) and should be interpreted as a cross-representation diagnostic, not a native MotionLCM paper number.


Implementation notes

  • hftrainer-native runtime: hftrainer/models/motion/motionlcm/network/ holds the MLD VAE, latent denoiser, text wrapper, scheduler config, and generation helper with package-local imports.
  • Checkpoint architecture is inferred from raw weights: upstream releases include both one-token v1 and sixteen-token checkpoint families; raw loading reads vae.global_motion_token / vae.latent_pre.weight so the artifact is built with the matching latent shape.
  • Sub-modules: vae + denoiser + scheduler; the default generation path uses distilled classifier-free guidance folded into the timestep conditioning.
  • Normalization travels with the checkpoint: Mean.npy / Std.npy are the HumanML3D training stats embedded in the artifact.
  • Text encoder: SentenceT5-Large is frozen and currently resolved by name; keep this explicit in any published Hub card.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support