Dreamer 4 reproduction on LIBERO — negative-result archive
This repository archives a faithful reproduction of Hafner et al. 2025 (arXiv:2509.24527) — Training Agents Inside of Scalable World Models — on the LIBERO 130-task benchmark, used as a cross-architecture baseline in the dreamer-vla paper.
The reproduction is built on top of nicklashansen/dreamer4 (unofficial PyTorch impl; the paper's official code is not released). We trained five variants spanning the architecture/initialization/aux-loss/scale axes; all variants produce negative R² on the LIBERO task-OOD action-probe, demonstrating that the Dreamer-V4 shortcut-forcing + tokenizer-reconstruction objective does not surface action-relevant features at LIBERO scale, regardless of pretraining or scale ablations.
This is published as a public reference companion to the paper's main negative finding.
Headline action-probe R² on LIBERO task-OOD test split
| Variant | params | test R² (action probe, H=1, mean of 3 seeds) |
|---|---|---|
| D4 dynamics (scratch, agent token) | 64M | −0.039 |
| D4 dynamics + aux λ=0.05 (agent) | 64M | −0.039 |
| D4 + DMControl pretrain + aux (agent) | 64M | −0.040 |
| D4 param-matched 276M + aux (agent) | 276M | −0.036 |
| V-JEPA 2 ViT-L + aux (paper main) | 304M | +0.845 |
| LAPA + aux (paper main) | 344M | +0.510 |
| LAPA frozen | 344M | +0.410 |
Recon (autoregressive rollout, 8-frame prefix → 24-frame post, PSNR averaged over span):
| Variant | rollout PSNR (LIBERO test_ood) |
|---|---|
| D4 dynamics (scratch) | 13.41 dB |
| D4 + aux | 13.41 dB |
| D4 + DMControl pretrain + aux | 10.28 dB |
| D4 param-matched 276M + aux | 13.05 dB |
Files
| Subdir | Description | Size |
|---|---|---|
tokenizer_v2_seq16/ |
Phase 2 tokenizer (LIBERO scratch, seq_len=16, mae_p_max=0.75, 85k steps). Reconstruction PSNR ≈ 25.18 dB. | 0.26GB |
dynamics_phase3/ |
Phase 3 dynamics (shortcut forcing + actions, 30k steps from scratch on LIBERO). | 0.51GB |
dynamics_aux_phase6/ |
Phase 6: Phase 3 + inverse-dynamics aux λ=0.05 (30k steps). aux_invdyn_mse 0.22 → 0.0004 during training, but probe still negative. | 0.52GB |
dynamics_aux_dmcontrol_pretrained/ |
DMControl pretrained (nicklashansen/dreamer4 HF) → LIBERO aux λ=0.05 finetune (20k steps). Tests transfer learning hypothesis. | 0.52GB |
dynamics_aux_param_matched_276M/ |
Param-matched D4: d_model_dyn=1024, dyn_depth=12 (276M total ≈ V-JEPA ViT-L 304M). + aux λ=0.05 from scratch on LIBERO, 20k steps. Tests scale hypothesis. | 3.06GB |
Each subdir contains:
ckpt_last.pt(orlatest.pt) — the saved state dictconfig.json— training args parsed from the ckpt'sargsfield
How to load (example: param-matched dynamics)
import sys, torch
sys.path.insert(0, "external_models/dreamer4/dreamer4") # clone nicklashansen/dreamer4 first
from model import Encoder, Decoder, Tokenizer, Dynamics, pack_bottleneck_to_spatial
# 1. tokenizer (Phase 2 v2)
tok_state = torch.load("tokenizer_v2_seq16/latest.pt", weights_only=False)
# (build Encoder + Decoder with tok_state["args"] hyperparams, then
# tok.load_state_dict(tok_state["model"]))
# 2. dynamics (param-matched)
dyn_state = torch.load("dynamics_aux_param_matched_276M/latest.pt", weights_only=False)
# (build Dynamics with d_model=1024, dyn_depth=12, dyn_state["args"], load dyn_state["dynamics"])
See src/comparison/extractors/dreamer4_extractor.py in
k1seul/dreamer-vla for a working
extractor that reconstructs both modules and exposes either spatial hidden
or agent tokens for probing.
Reproduction details
Pipeline
- Data:
scripts/preprocess_libero_for_dreamer4.pyconverts LIBEROagentview_rgbHDF5 episodes into nicklashansen's sharded format (104 train tasks, 4160 demos, 651k frames at 128×128). - Phase 2 tokenizer:
train_tokenizer.pyon 3-4 GPUs, 85k steps reached, PSNR ≈ 25.18 dB (within nicklashansen's DMControl 28 dB sanity range for the smaller-data regime). - Phase 3 dynamics:
train_dynamics.py --use_actions, 5 GPUs batch 4, 30k steps. Shortcut forcing loss converged 40× (0.20 → 0.005). - Phase 6 aux variant:
train_dynamics.py + dreamer-vla aux patchadds anInvDynAuxHeadon the dynamics transformer's spatial hidden state, λ=0.05. aux head saved separately to keep the eval ckpt aux-free. - DMControl-pretrained finetune: resume from nicklashansen's HF
tokenizer.pt+dynamics.pt(90k+40k steps on DMControl 30 tasks), then continue 20k steps on LIBERO with the aux patch. - Param-matched scale: re-train Phase 6 with
d_model_dyn=1024,dyn_depth=12→ 276M total (~V-JEPA ViT-L 304M scale).
Caveats / honest limits
- All training runs hit a final-step DDP barrier hang (reproducible on
the cluster; root cause not investigated). Final saved ckpt is one
save_everyinterval (5k or 10k) behind the requestedmax_steps. - nicklashansen's PyTorch impl is unofficial; "shortcut forcing" hyperparameters may differ from the paper. We use nicklashansen's defaults except where noted.
- This is a best-effort reproduction at LIBERO scale. The original paper's 2B-param Minecraft model is not reachable for us.
- All probe results are on a strict episode-disjoint protocol: probe-train = 400 train.json episodes, probe-eval-train = 200 OTHER train.json episodes, probe-eval-test = 200 test_ood.json episodes (27 train tasks / 7 OOD tasks held out at task level).
Reproduction quality bars (set in paper Section §x)
- ✓ Tokenizer PSNR ≥ 25 dB on LIBERO 128×128 (achieved 25.18 dB)
- ✗ Dynamics rollout PSNR ≥ our DIFF baseline 18 dB (achieved 13.4 dB)
- ✓ Loss curves smooth, no divergence
- ✓ aux head training functional (aux_invdyn_mse 0.22 → 0.0004 during training)
The rollout-PSNR miss is acknowledged. We document it instead of hiding it — it is consistent with the broader negative-result narrative.
Why this is published
The dreamer-vla paper argues that V-JEPA-style masked prediction pretraining is the dominant lever for action-relevant feature emergence, not architecture or aux loss design. Dreamer 4 — the natural baseline this paper references — is the strongest single counter-example: its shortcut-forcing objective, even with our auxiliary inverse-dynamics patch and scale match to V-JEPA, fails to surface positive R² on LIBERO task-OOD action probing.
We publish these ckpts so reviewers and follow-up work can independently verify the reproduction is functional (the aux head DOES learn during training, dropping 500×) and that the negative result is robust across ablations (scratch / DMControl pretrain / aux / no aux / 64M / 276M / spatial probe / agent probe). The bottleneck is fundamental, not a training-quality artifact we hid.
Citation
@misc{Hafner2025TrainingAgents,
title={Training Agents Inside of Scalable World Models},
author={Danijar Hafner and Wilson Yan and Timothy Lillicrap},
year={2025}, eprint={2509.24527}, archivePrefix={arXiv}
}
@misc{Hansen2026Dreamer4PyTorch,
title={Dreamer 4 in PyTorch}, author={Nicklas Hansen}, year={2026},
publisher={GitHub},
howpublished={\url{https://github.com/nicklashansen/dreamer4}}
}
(dreamer-vla paper citation will be added when on arXiv.)
Companion repos
- LIBERO Stage B (V-JEPA + aux, LAPA frozen, etc.): Scuttie/dreamer-vla-libero-vjepa2-aux
- LAPA + aux finetune: Scuttie/dreamer-vla-lapa-aux-libero