MotionGPT - Human Motion as a Foreign Language

Text-to-motion baseline integrated into the hftrainer Model Zoo. The runtime is self-contained under hftrainer.models.motion.motiongpt.network and does not import the original repository at inference time.


Task	Text-to-Motion (T2M), motion-language generation
Bundle / Pipeline	`MotionGPTBundle` / `MotionGPTPipeline`
Processed HF artifact	`ZeyuLing/hftrainer-motiongpt-humanml3d`
Motion representation	HumanML3D-263 (263-dim, 20 fps, 22 joints)
Architecture	Motion tokenizer VQ-VAE + FLAN-T5-base-style language model with motion tokens
Paper	MotionGPT: Human Motion as a Foreign Language, Jiang et al., NeurIPS 2023 - arXiv:2306.14795
Original code	https://github.com/OpenMotionLab/MotionGPT

Weights

Self-contained hftrainer artifact:

Artifact	Location	Contents	Status
MotionGPT HumanML3D	`ZeyuLing/hftrainer-motiongpt-humanml3d`	`motiongpt_s3_h3d.tar` + `assets/meta/{mean,std}.npy` + `deps/flan-t5-base/` + `model_index.json`	public Hub artifact
local mirror	`checkpoints/baselines/motiongpt`	same layout	optional local cache

Use directly from the Hub:

from hftrainer.pipelines.motiongpt import MotionGPTPipeline

pipe = MotionGPTPipeline.from_pretrained(
    "ZeyuLing/hftrainer-motiongpt-humanml3d",
    bundle_kwargs={"local_files_only": False},
    device="cuda",
)
motions = pipe.infer_t2m(
    ["a person walks forward then sits down"],
    [120],
)  # list of (T, 263)

For a local mirror:

pipe = MotionGPTPipeline.from_pretrained(
    "checkpoints/baselines/motiongpt",
    bundle_kwargs={"local_files_only": True},
    device="cuda",
)

Motion Representation

MotionGPT natively generates HumanML3D-263 at 20 fps. For shared SMPL and MotionStreamer-272 evaluation, use the validated bridge:

HumanML3D-263 -> SMPL motion_135 via IK refine-80 -> MotionStreamer-272

The artifact packages the released MotionGPT checkpoint, HumanML3D statistics, and the local FLAN-T5-base tokenizer/config files required to instantiate the language model without a separate upstream checkout.

HumanML3D Leaderboard Metrics

The row below uses the shared HumanML3D official-test caption protocol and the HML263 round-trip GT reference for SMPL-based evaluators. MotionCLIP metrics use raw projection embeddings without L2 normalization.

Evaluator	R1 up	R2 up	R3 up	FID down	MM down	Div up
MotionStreamer-272	0.4940	0.6352	0.6944	23.6811	19.6781	25.5410
MotionCLIP-135 no-L2	0.3688	0.5049	0.5828	84.8756	42.8579	23.2174

Physical metrics:

Slide down	Float down	Jitter down	Dynamic down
3.8783	10.8835	5.1680	21.0609

Implementation Notes

Artifact inference imports only hftrainer.models.motion.motiongpt.network.
The released checkpoint has FLAN-T5-base / T5-v1.1 FFN shapes rather than ordinary t5-base FFN shapes.
The checkpoint stores a distinct LM head while sharing the encoder and decoder input embeddings; the bundle keeps shared_encoder_decoder_untied_lm_head.
The validated HumanML3D setting uses the official no-length prompt mode (official_nolen) and the selected-caption official-test protocol.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Other

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for ZeyuLing/hftrainer-motiongpt-humanml3d

MotionGPT: Human Motion as a Foreign Language

Paper • 2306.14795 • Published Jun 26, 2023 • 28