T2M-GPT β Generating Human Motion from Textual Descriptions with Discrete Representations
Text-to-motion baseline integrated into the hftrainer Model Zoo. Our
reproduction is fully self-contained and independent of ref_repo: the
VQ-VAE motion tokenizer (HumanVQVAE) and the GPT-style autoregressive
generator (Text2Motion_Transformer) are vendored into
hftrainer.models.motion.t2mgpt, preserving numerical parity with the released
HumanML3D checkpoints. New hftrainer artifacts include the frozen CLIP ViT-B/32
text encoder as clip.safetensors; legacy lightweight artifacts still fall
back to loading CLIP by name.
| Task | Text-to-Motion (T2M) |
| Bundle / Pipeline | T2MGPTBundle / T2MGPTPipeline |
| Processed HF artifact | ZeyuLing/hftrainer-t2mgpt-humanml3d |
| Motion representation | HumanML3D-263 (263-dim, 20 fps, 22 joints) |
| Tokenizer | HumanVQVAE (VQ-VAE, EMA-reset quantizer, codebook 512Γ512, down_t=2) |
| Generator | Text2Motion_Transformer (GPT, 9 layers, 16 heads, embed 1024) |
| Text encoder | CLIP ViT-B/32 (frozen) |
| Paper | T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations, Zhang et al., CVPR 2023 β arXiv:2301.06052 |
| Original code | https://github.com/Mael-zys/T2M-GPT |
Weights
Self-contained hftrainer artifact (diffusers-style from_pretrained):
| Artifact | Location | Contents | Status |
|---|---|---|---|
| T2M-GPT HumanML3D | ZeyuLing/hftrainer-t2mgpt-humanml3d |
vq.safetensors + gpt.safetensors + clip.safetensors + t2mgpt_config.json + Mean.npy / Std.npy |
public Hub artifact |
| local mirror | checkpoints/t2mgpt/humanml3d |
same layout | optional local cache |
from hftrainer.pipelines.t2mgpt import T2MGPTPipeline
pipe = T2MGPTPipeline.from_pretrained(
"ZeyuLing/hftrainer-t2mgpt-humanml3d",
device="cuda",
)
motions = pipe.infer_t2m(
["a person walks forward then sits down"],
[120],
) # list of (T, 263)
The artifact is produced from the released upstream .pth checkpoints (a VQ-VAE
net_last.pth with a net key and a GPT net_best_fid.pth with a trans key)
with scripts/eval/convert_t2mgpt_checkpoint.py. The --verify flag reloads the
artifact and asserts bit-identical generation after the round-trip
(max-abs-diff = 0):
python3 scripts/eval/convert_t2mgpt_checkpoint.py \
--vq_path ref_repo/T2M-GPT/pretrained/VQVAE/net_last.pth \
--gpt_path ref_repo/T2M-GPT/pretrained/VQTransformer_corruption05/net_best_fid.pth \
--out_dir checkpoints/t2mgpt/humanml3d --verify
# -> [verify] OK: artifact is bit-identical to the raw checkpoint.
Use it:
from hftrainer.pipelines.t2mgpt import T2MGPTPipeline
pipe = T2MGPTPipeline.from_pretrained("checkpoints/t2mgpt/humanml3d", device="cuda")
# let the GPT decide the length via its EOS token:
motions = pipe.infer_t2m(["a person walks forward then sits down"]) # list of (T, 263)
# or clip to a fixed length (frames @ 20 fps):
motions = pipe.infer_t2m(["a person walks forward then sits down"], [120])
You can also drive it directly from the released .pth weights, no conversion
needed:
bundle = T2MGPTBundle(
vq_path="ref_repo/T2M-GPT/pretrained/VQVAE/net_last.pth",
gpt_path="ref_repo/T2M-GPT/pretrained/VQTransformer_corruption05/net_best_fid.pth",
)
Motion representation
HumanML3D-263, the standard redundant T2M feature (Guo et al.), 20 fps, 22-joint SMPL skeleton. Per frame (263 dims):
| Slice | Dim | Meaning |
|---|---|---|
root_rot_vel |
1 | root angular velocity (about Y) |
root_lin_vel |
2 | root linear velocity (XZ plane) |
root_y |
1 | root height |
ric_data |
63 | local joint positions (21Γ3) |
rot_data |
126 | local joint rotations (21Γ6, cont. 6D) |
local_vel |
66 | local joint velocities (22Γ3) |
foot_contact |
4 | binary foot-contact labels |
The VQ-VAE tokenizes this with a temporal downsampling of down_t=2
(stride_t=2), i.e. one token β 4 frames, so a 196-frame motion maps to β€ 49
discrete tokens drawn from a 512-entry codebook.
Generation
Two vendored stages (parity with the gold-standard t2mgpt_infer_hml3d263.py):
Text2Motion_Transformer(GPT) β autoregressively samples motion token indices conditioned on the CLIP ViT-B/32 text embedding. Unlike a diffusion model, T2M-GPT determines the sequence length itself via a learned EOS token, so no target length is fed by default.HumanVQVAE.forward_decoderβ de-quantizes the predicted(T,)token sequence and decodes to the 263-dim feature, then de-normalises with the trainingMean/Std.
Evaluation
Generation under the official HumanML3D protocol (standard test split,
native 263-dim @ 20 fps, first caption, model-chosen length β not truncated to
GT) and scoring with the persisted HumanML263Evaluator. Reproduce with:
# 1) generate (whole test split, GPT picks the length)
python3 scripts/eval/t2mgpt_t2m_h3d263.py \
--model_path checkpoints/t2mgpt/humanml3d \
--out_dir outputs/evaluation/t2mgpt_h3d263_official/t2mgpt_263 \
--progress
# 2) score with the HumanML3D-263 evaluator
python3 scripts/eval/verify_evaluators.py --which hml263 \
--hml263-pred outputs/evaluation/t2mgpt_h3d263_official/t2mgpt_263
HumanML3D-263 evaluator (native space)
HumanML263Evaluator now follows the canonical Guo/MoMask
Text2MotionDatasetEval protocol (time-tagged captions spawn sub-clip samples,
random caption + shuffle + coin2 length jitter + drop_last), so its GT/Real
reference row matches the published GT row. Two measurements:
- T2M-GPT predictions (
n_samples = 3940full-clip captions,caption = firstto match generation,n_repeats = 20). - GT/Real reference (full canonical set incl. sub-clips,
n_samples = 4402, random caption) β validates the evaluator against the paper's GT row.
| Metric | hftrainer T2M-GPT | T2M-GPT paper | hftrainer GT/Real | paper GT/Real |
|---|---|---|---|---|
| FID β | 0.176 | 0.116 | β | 0.002 |
| R-Precision Top-1 β | 0.470 | 0.417 | 0.508 | 0.511 |
| R-Precision Top-2 β | 0.660 | β | 0.698 | 0.703 |
| R-Precision Top-3 β | 0.761 | 0.745 | 0.796 | 0.797 |
| MM-Dist β | 3.238 | 3.118 | 2.970 | 2.974 |
| Diversity β | 9.563 | 9.761 | 9.398 | 9.503 |
The GT/Real row now lands on the paper's GT (R@3 0.796 vs 0.797, MM-Dist 2.970
vs 2.974), confirming the evaluator is faithful to the canonical protocol. The
generation path is verified bit-identical to the released T2M-GPT checkpoint
(convert_t2mgpt_checkpoint.py --verify, max-abs-diff = 0); R-Precision even
edges above the reported values (Top-3 0.761 vs 0.745).
The residual FID gap (0.176 vs 0.116) is a population difference, not a model or
evaluator bug: the canonical eval set contains 4402 samples (incl. sub-clips),
but the T2M-GPT predictions here cover only the 3940 full clips β sub-clips have
no matching prediction and are dropped, so pred-vs-GT runs on the full-clip
subset. A strictly paper-comparable FID requires generating one prediction per
canonical sample (incl. <id>__sub<k>); see the Model Zoo TODO.
MotionStreamer-272 evaluator (SMPL retarget path)
For cross-model comparison with the MotionStreamer / HYMotion-M2M evaluator, the native HumanML3D-263 predictions are retargeted through the validated MDM-style chain:
# Stage A: HML263 -> SMPL motion_135 (IK refine-80, 20 -> 30 fps)
# Stage B: motion_135 -> MotionStreamer-272
# Stage C: MotionStreamer272Evaluator
N=4 bash scripts/eval/_run_263_to_ms272_taiji.sh
The IK stage uses hml263_to_smpl_ik.py --refine-iters 80 --floor-align --rot6d-convention row, matching the MDM reproduction path; using the
non-refined analytic IK substantially inflates MS-272 FID.
| Metric | hftrainer T2M-GPT | MS-272 GT/Real |
|---|---|---|
| FID β | 113.316 | 0.000 |
| R-Precision Top-1 β | 0.446 | 0.706 |
| R-Precision Top-2 β | 0.600 | 0.857 |
| R-Precision Top-3 β | 0.678 | 0.911 |
| MM-Dist β | 19.787 | 15.007 |
| Diversity β | 25.405 | 27.300 |
Run details: n_repeats = 20, n_samples_used = 7328,
skipped_no_pred = 66, outputs under
outputs/evaluation/ms272_from263/t2mgpt_272, metrics in
outputs/evaluation/ms272_from263/metrics_t2mgpt.json.
Implementation notes
- Vendored, ref_repo-independent:
hftrainer/models/motion/t2mgpt/holds the VQ-VAE and the GPT transformer with package-relative imports; the artifact loads with zero dependency onref_repoor the original.pthformat. - Sub-modules:
vqvae(HumanVQVAE) +gpt(Text2Motion_Transformer), configured byt2mgpt_config.json(vqvae:nb_code=512,code_dim=512,down_t=2,quantizer=ema_reset;gpt:embed_dim_gpt=1024,num_layers=9,n_head_gpt=16,block_size=51). - CLIP: frozen ViT-B/32 is stored once as
clip.safetensorsin new artifacts and restored byT2MGPTBundle.from_pretrained. Legacy lightweight artifacts withoutclip.safetensorsstill fall back toclip_name. - Normalization travels with the checkpoint:
Mean.npy/Std.npyare the 263-dim training stats, embedded in the artifact (self-contained). - Length: the GPT emits an EOS token, so the model selects its own sequence
length; pass
--truncate_to_gttot2mgpt_t2m_h3d263.pyto clip to GT length instead.