Dreamer 4 reproduction on LIBERO — negative-result archive

This repository archives a faithful reproduction of Hafner et al. 2025 (arXiv:2509.24527)Training Agents Inside of Scalable World Models — on the LIBERO 130-task benchmark, used as a cross-architecture baseline in the dreamer-vla paper.

The reproduction is built on top of nicklashansen/dreamer4 (unofficial PyTorch impl; the paper's official code is not released). We trained five variants spanning the architecture/initialization/aux-loss/scale axes; all variants produce negative R² on the LIBERO task-OOD action-probe, demonstrating that the Dreamer-V4 shortcut-forcing + tokenizer-reconstruction objective does not surface action-relevant features at LIBERO scale, regardless of pretraining or scale ablations.

This is published as a public reference companion to the paper's main negative finding.

Headline action-probe R² on LIBERO task-OOD test split

Variant params test R² (action probe, H=1, mean of 3 seeds)
D4 dynamics (scratch, agent token) 64M −0.039
D4 dynamics + aux λ=0.05 (agent) 64M −0.039
D4 + DMControl pretrain + aux (agent) 64M −0.040
D4 param-matched 276M + aux (agent) 276M −0.036
V-JEPA 2 ViT-L + aux (paper main) 304M +0.845
LAPA + aux (paper main) 344M +0.510
LAPA frozen 344M +0.410

Recon (autoregressive rollout, 8-frame prefix → 24-frame post, PSNR averaged over span):

Variant rollout PSNR (LIBERO test_ood)
D4 dynamics (scratch) 13.41 dB
D4 + aux 13.41 dB
D4 + DMControl pretrain + aux 10.28 dB
D4 param-matched 276M + aux 13.05 dB

Files

Subdir Description Size
tokenizer_v2_seq16/ Phase 2 tokenizer (LIBERO scratch, seq_len=16, mae_p_max=0.75, 85k steps). Reconstruction PSNR ≈ 25.18 dB. 0.26GB
dynamics_phase3/ Phase 3 dynamics (shortcut forcing + actions, 30k steps from scratch on LIBERO). 0.51GB
dynamics_aux_phase6/ Phase 6: Phase 3 + inverse-dynamics aux λ=0.05 (30k steps). aux_invdyn_mse 0.22 → 0.0004 during training, but probe still negative. 0.52GB
dynamics_aux_dmcontrol_pretrained/ DMControl pretrained (nicklashansen/dreamer4 HF) → LIBERO aux λ=0.05 finetune (20k steps). Tests transfer learning hypothesis. 0.52GB
dynamics_aux_param_matched_276M/ Param-matched D4: d_model_dyn=1024, dyn_depth=12 (276M total ≈ V-JEPA ViT-L 304M). + aux λ=0.05 from scratch on LIBERO, 20k steps. Tests scale hypothesis. 3.06GB

Each subdir contains:

  • ckpt_last.pt (or latest.pt) — the saved state dict
  • config.json — training args parsed from the ckpt's args field

How to load (example: param-matched dynamics)

import sys, torch
sys.path.insert(0, "external_models/dreamer4/dreamer4")  # clone nicklashansen/dreamer4 first
from model import Encoder, Decoder, Tokenizer, Dynamics, pack_bottleneck_to_spatial

# 1. tokenizer (Phase 2 v2)
tok_state = torch.load("tokenizer_v2_seq16/latest.pt", weights_only=False)
# (build Encoder + Decoder with tok_state["args"] hyperparams, then
#  tok.load_state_dict(tok_state["model"]))

# 2. dynamics (param-matched)
dyn_state = torch.load("dynamics_aux_param_matched_276M/latest.pt", weights_only=False)
# (build Dynamics with d_model=1024, dyn_depth=12, dyn_state["args"], load dyn_state["dynamics"])

See src/comparison/extractors/dreamer4_extractor.py in k1seul/dreamer-vla for a working extractor that reconstructs both modules and exposes either spatial hidden or agent tokens for probing.

Reproduction details

Pipeline

  1. Data: scripts/preprocess_libero_for_dreamer4.py converts LIBERO agentview_rgb HDF5 episodes into nicklashansen's sharded format (104 train tasks, 4160 demos, 651k frames at 128×128).
  2. Phase 2 tokenizer: train_tokenizer.py on 3-4 GPUs, 85k steps reached, PSNR ≈ 25.18 dB (within nicklashansen's DMControl 28 dB sanity range for the smaller-data regime).
  3. Phase 3 dynamics: train_dynamics.py --use_actions, 5 GPUs batch 4, 30k steps. Shortcut forcing loss converged 40× (0.20 → 0.005).
  4. Phase 6 aux variant: train_dynamics.py + dreamer-vla aux patch adds an InvDynAuxHead on the dynamics transformer's spatial hidden state, λ=0.05. aux head saved separately to keep the eval ckpt aux-free.
  5. DMControl-pretrained finetune: resume from nicklashansen's HF tokenizer.pt + dynamics.pt (90k+40k steps on DMControl 30 tasks), then continue 20k steps on LIBERO with the aux patch.
  6. Param-matched scale: re-train Phase 6 with d_model_dyn=1024, dyn_depth=12 → 276M total (~V-JEPA ViT-L 304M scale).

Caveats / honest limits

  • All training runs hit a final-step DDP barrier hang (reproducible on the cluster; root cause not investigated). Final saved ckpt is one save_every interval (5k or 10k) behind the requested max_steps.
  • nicklashansen's PyTorch impl is unofficial; "shortcut forcing" hyperparameters may differ from the paper. We use nicklashansen's defaults except where noted.
  • This is a best-effort reproduction at LIBERO scale. The original paper's 2B-param Minecraft model is not reachable for us.
  • All probe results are on a strict episode-disjoint protocol: probe-train = 400 train.json episodes, probe-eval-train = 200 OTHER train.json episodes, probe-eval-test = 200 test_ood.json episodes (27 train tasks / 7 OOD tasks held out at task level).

Reproduction quality bars (set in paper Section §x)

  • ✓ Tokenizer PSNR ≥ 25 dB on LIBERO 128×128 (achieved 25.18 dB)
  • ✗ Dynamics rollout PSNR ≥ our DIFF baseline 18 dB (achieved 13.4 dB)
  • ✓ Loss curves smooth, no divergence
  • ✓ aux head training functional (aux_invdyn_mse 0.22 → 0.0004 during training)

The rollout-PSNR miss is acknowledged. We document it instead of hiding it — it is consistent with the broader negative-result narrative.

Why this is published

The dreamer-vla paper argues that V-JEPA-style masked prediction pretraining is the dominant lever for action-relevant feature emergence, not architecture or aux loss design. Dreamer 4 — the natural baseline this paper references — is the strongest single counter-example: its shortcut-forcing objective, even with our auxiliary inverse-dynamics patch and scale match to V-JEPA, fails to surface positive R² on LIBERO task-OOD action probing.

We publish these ckpts so reviewers and follow-up work can independently verify the reproduction is functional (the aux head DOES learn during training, dropping 500×) and that the negative result is robust across ablations (scratch / DMControl pretrain / aux / no aux / 64M / 276M / spatial probe / agent probe). The bottleneck is fundamental, not a training-quality artifact we hid.

Citation

@misc{Hafner2025TrainingAgents,
  title={Training Agents Inside of Scalable World Models},
  author={Danijar Hafner and Wilson Yan and Timothy Lillicrap},
  year={2025}, eprint={2509.24527}, archivePrefix={arXiv}
}
@misc{Hansen2026Dreamer4PyTorch,
  title={Dreamer 4 in PyTorch}, author={Nicklas Hansen}, year={2026},
  publisher={GitHub},
  howpublished={\url{https://github.com/nicklashansen/dreamer4}}
}

(dreamer-vla paper citation will be added when on arXiv.)

Companion repos

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Paper for Dreamer-VLA/dreamer-vla-dreamer4-libero