Camais03
/

camie-crafter

@@ -15,9 +15,9 @@ tags:
   - beta
 ---
-# Crafter World Model (Beta)
-## Update
 **Action sensitivity appears to be fixed in the current beta training setup, and the project has now moved into a beta phase.**
@@ -30,26 +30,25 @@ The main system is now reliably action-conditioned enough to expose publicly as
 ![Action Sensitivity Update](images/action_sensitivity_update.png)
-### Current rollout behaviour
 <img src="images/rollout_large.gif" alt="Current rollout behaviour" width="1024">
 ### Training progress
-![Training progress](images/wm_val_ode.png)
-### Action sensitivity
-![Action sensitivity](images/action_sensitivity_500k_full.png)
 ---
-## What this repository is
 This repository contains my current work on an **action-conditioned world model for Crafter**, forming the first phase of a broader research agenda around:
-- model-based reinforcement learning
-- imagination-based control
-- long-horizon planning
-- sparse-reward environments
-- scalable world models trained on consumer hardware
 This project is currently best understood as a **research prototype in beta**.
@@ -57,324 +56,323 @@ It already includes a usable tokenizer, a latent-space action-conditioned world
 ---
-## Current status
 The current system can:
-- compress Crafter observations into compact latent tokens
-- model future latent dynamics conditioned on actions
-- generate coherent multi-step rollouts
-- decode those rollouts back into plausible video
-- expose the model through an interactive imagined-game demo
 Earlier versions of this project could generate convincing futures without really following the supplied action sequence. That was the central bottleneck. The current setup appears to have resolved that issue well enough for public beta release.
 That said, the model is still not perfect. Remaining weaknesses include:
-- small-object confusion
-- some inventory and HUD detail errors
-- object-location drift
-- occasional mixing of similar sprites or structures
-- degradation over longer autoregressive rollouts
-- some remaining brittleness around rare states and rare transitions
 So this is a **serious, working beta research system**, not a final benchmarked product.
 ---
-## Hardware note
 A major goal of this project is to show that meaningful world-model research can be done on **consumer hardware**.
 This work was trained on **a single RTX 3090 (24 GB)**.
-The setup should also be feasible on a **3060-class GPU** with smaller microbatches and **gradient accumulation**, at the cost of training speed.
 ---
-## Project goal
 The immediate goal is to learn a world model that can:
-1. compress Crafter observations into useful latent tokens
-2. model future latent dynamics conditioned on actions
-3. produce multi-step rollouts that are both visually coherent and action-faithful
 The longer-term goal is to use these learned dynamics for:
-- planning
-- control
-- reinforcement learning in imagination
-- eventually more general agents that can reason over imagined futures
 ---
-## Relation to prior work
 This project is strongly inspired by recent scalable world-model work, especially the combination of:
-- causal or masked video tokenizers
-- latent-space dynamics models
-- action-conditioned rollout generation
-- evaluation through rollout quality and action sensitivity rather than reconstruction alone
 It is **inspired by Dreamer-4 style work**, but it is **not a full reproduction**, and it currently covers **only the world-modeling part of the pipeline**, not the later RL agent-training phase.
 It also draws from work on:
-- Crafter as a benchmark for sparse-reward, compositional environments
-- masked autoencoders as tokenizers for generative models
-- diffusion / shortcut-style training for latent dynamics
 ---
-## What is included
 This repository currently includes code and assets for:
-- **Crafter data collection**
-- **causal MAE tokenizer pretraining**
-- **latent world-model pretraining**
-- **evaluation and diagnostics**
-- **interactive imagination demo / web-app deployment path**
-- **exported checkpoints in PyTorch, safetensors, and ONNX formats**
 It also includes supporting outputs such as:
-- rollout visualisations
-- validation plots
-- action sensitivity plots
-- failure-mode examples
-- exported checkpoints under `checkpoints/`
 ---
-## Included checkpoints and exports
 The current exported files live under `checkpoints/` and include:
-- `mae_model.safetensors`
-- `mae_decode.onnx`
-- `world_model.safetensors`
-- `world_model_ema.safetensors`
-- `world_model.onnx`
 These are intended to support:
-- direct checkpoint download
-- lightweight inference experiments
-- Hugging Face Spaces deployment
-- future ONNX Runtime / browser / API-based demos
 ---
-## Interactive demo / app
 This repo includes a usable interactive demo path for testing the world model as an imagined game.
 The basic idea is:
-- start from a real context window from the Crafter dataset
-- choose an action
-- predict the next latent frame with the world model
-- decode it through the MAE decoder
-- feed the prediction back into context
-- continue rolling forward open-loop
 This is not the real Crafter environment running underneath. It is the **model’s imagined continuation** of the game.
 The Hugging Face Space version is intended to make this easy to test without needing to run the training code locally.
-### Intended controls
-- **Arrow keys / WASD**: movement
-- **Space**: interact / do
-- **Tab**: noop
-- **Shift**: sleep
-- **1–0**: place / craft actions
-- **R**: reset
-- **G**: save gif/json in the local demo version
 ---
-## Training data
 The current beta model was trained primarily on **Crafter human expert data**.
-That choice was deliberate. At this stage, I wanted to maximize the density of meaningful action-conditioned transitions rather than optimize for broad coverage from random-policy play.
-A later stage of the project will revisit broader or more mixed data collection, including more game-agnostic or random-policy style data, but the current release is mainly built around the human expert regime.
 ---
-## Data collection policy
 This repository also includes my current **Crafter data-collection policy code**, which was designed to improve action-conditioned learning by increasing the fraction of transitions where actions produce meaningful state changes.
 Key ideas include:
-- stuck detection through frame-change heuristics
-- forced interaction bursts when the agent appears stuck
-- periodic interleaving of interaction actions during movement
-- adaptive exploration behaviour
-- cleaner action classification logic to avoid action-name matching bugs
-- shard-based storage with episode metadata, gifs, and achievement events
-The motivation is simple: **if actions rarely produce visible consequences in the data, the world model has much less incentive to learn action-faithful dynamics**.
 ---
-## Main components
-## 1. Causal MAE tokenizer
 The tokenizer is a **causal masked autoencoder** trained on Crafter frame sequences.
 Main properties:
-- tube masking across frames
-- spatial self-attention within frames
-- periodic temporal causal attention
-- bottlenecked latent representation
-- MAE-style masked reconstruction objective
-- LPIPS-assisted reconstruction training
-- latent outputs intended for downstream world modeling, not just pretty decoding
 A major lesson from this project so far is that the tokenizer matters a lot more than it may seem at first.
 In particular:
-- **high masking turned out to be important**
-- lower masking can give cleaner-looking reconstructions while producing **worse downstream action sensitivity**
-- decoder quality alone is not a sufficient measure of whether the latent space is good for dynamics
 So although the decoder is mostly used for visualisation, the **latent space quality is still critical**, because the world model operates in that latent space.
 ---
-## 2. Latent world model
 The world model is trained in latent space using an action-conditioned architecture based around a DiT-style token backbone.
 Current ingredients include:
-- action-conditioned latent prediction
-- shortcut-forcing style training
-- bucketed context / prediction-length sampling
-- autoregressive rollout evaluation
-- action-sensitivity diagnostics
-- EMA checkpointing
-- validation across multiple `(context, prediction)` regimes
 The world model now produces:
-- coherent future rollouts
-- much better action sensitivity than earlier versions
-- usable imagined-game behaviour in open loop
 This is the main milestone that moved the project into beta.
 ---
-## 3. Diagnostics and evaluation
 I track progress with multiple diagnostics rather than relying on training loss alone.
 These include:
-- fixed-noise / denoising validation
-- ODE-style reconstruction validation
-- autoregressive rollout evaluation
-- action sensitivity evaluation
-- rollout gifs
-- failure-case inspection
-- multi-regime validation over several context/prediction bucket pairs
 This matters because a model can look good in one metric while still failing in the behaviour I actually care about.
 ---
-## 4. Latent-space analysis
 I am also investigating better ways to reason about what makes a **good latent space** for downstream world modeling.
 The current exploratory tooling includes:
-- **UMAP** visualisation of latent structure
-- **GMM** complexity analysis over latent features
-- checkpoint-to-checkpoint latent comparisons
 At the moment this remains exploratory. I do not yet think I fully understand how to interpret these plots in a way that is directly actionable for world-model training, but I think it is an important direction.
 There is space in this repo for that analysis to become much more systematic over time.
-### Example latent analysis
-![UMAP of latent space](images/umap.png)
-![GMM latent complexity](images/gmm_curve.png)
 ---
-## Current strengths
 The current beta model already shows several encouraging properties:
-- coherent latent rollouts
-- meaningful action conditioning
-- usable open-loop imagination
-- multi-step rollout generation
-- stable training on a single consumer GPU
-- a clear path to demo deployment through Hugging Face Spaces
 ---
-## Current failure modes
 The project is still very much an active research system, and several failure modes remain important.
-### World-model failure modes
 Typical world-model failures include:
-- staying too stationary in some cases
-- gradual object-position drift across rollout steps
-- small rare details disappearing
-- rare entities or tiles becoming blurry or unstable
-- longer-horizon compounding error
-### Tokenizer / decoder failure modes
-Typical tokenizer-related issues include:
-- inventory number mistakes
-- arrows or other small details being missed
-- confusion between furnaces, crafting tables, and similar sprites
-- imperfect preservation of object identity
-- occasional loss of fine HUD detail
-### Example failure cases
-![Failure mode example 1](images/failure_mode_1.png)
-![Failure mode example 2](images/failure_mode_2.png)
-### Action-space comparison
-![Bad vs improved action sensitivity](images/action_space_bad_vs_fixed.png)
 These examples are included deliberately. I do not want the repo to present only the successes. The failure modes are a major part of the research story.
 ---
-## Why the tokenizer matters so much
 One of the clearest takeaways from this work is that a tokenizer can look visually decent while still being a poor substrate for dynamics learning.
 A world model does not need the prettiest decoder output. It needs latents that preserve the distinctions required for:
-- causality
-- controllability
-- object identity
-- local change
-- action consequence
 That is why masking level, bottleneck structure, and latent organisation matter so much here.
@@ -382,15 +380,15 @@ My current view is that **a good world-model tokenizer is not just a compression
 ---
-## Repository state
 A few caveats up front:
-- the training code is still a bit messy
-- some scripts were written for active notebook-based iteration
-- local paths may need editing before reuse
-- there are still older comments, experimental branches, and rough edges
-- names and interfaces may change as the project is cleaned up
 I am still sharing it because the core technical direction is now clear and useful.
@@ -398,59 +396,59 @@ Cleaning up the code for a more polished release is one of the next major tasks.
 ---
-## Intended direction
 My aim is for this repository to become a strong base for other researchers who want to work on:
-- world models
-- latent dynamics
-- imagination-based planning
-- action-conditioned generative models
-- model-based RL on consumer hardware
 Over time I want this project to include:
-- cleaner training scripts
-- clearer explanations of each component
-- more structured ablations
-- better evaluation tools
-- fuller reproduction instructions
-- eventual downstream agent-training in imagination
 ---
-## Scope of this release
 This release should be understood as:
-- a **beta research release**
-- a **working action-conditioned world model**
-- a **portfolio / research artifact**
-- a **foundation for future planning and RL work**
 It should **not** be understood as:
-- a polished library
-- a final benchmark result
-- a full Dreamer-4 reproduction
-- a complete end-to-end agent-training system
 ---
-## If you want to explore the project
 Good places to start are:
-- the exported checkpoints under `checkpoints/`
-- the demo / app
-- the tokenizer training code
-- the world-model training code
-- the validation plots and rollout gifs
-- the failure-mode examples
 ---
-## Acknowledgements
 This project was strongly influenced by several pieces of prior work:
@@ -464,12 +462,12 @@ Any mistakes, implementation choices, and deviations from the referenced work ar
 ---
-## Citation
 If this repository is useful to your work, please cite the repository and the relevant upstream papers.
 ---
-## Status
 **Beta. Active research. Action sensitivity fixed in the current setup, with further training, testing, cleanup, and longer-horizon improvement still in progress.**

   - beta
 ---
+# Crafter World Model (Beta):
+## Update:
 **Action sensitivity appears to be fixed in the current beta training setup, and the project has now moved into a beta phase.**
 ![Action Sensitivity Update](images/action_sensitivity_update.png)
+### Current rollout behaviour:
 <img src="images/rollout_large.gif" alt="Current rollout behaviour" width="1024">
 ### Training progress
+![ODE progress](images/wm_val_ode.png)
+![Rollout progress](images/rollout_curves.png)
 ---
+## What this repository is:
 This repository contains my current work on an **action-conditioned world model for Crafter**, forming the first phase of a broader research agenda around:
+- model-based reinforcement learning.
+- imagination-based control.
+- long-horizon planning.
+- sparse-reward environments.
+- scalable world models trained on consumer hardware.
 This project is currently best understood as a **research prototype in beta**.
 ---
+## Current status:
 The current system can:
+- compress Crafter observations into compact latent tokens.
+- model future latent dynamics conditioned on actions.
+- generate coherent multi-step rollouts.
+- decode those rollouts back into plausible video.
+- expose the model through an interactive imagined-game demo.
 Earlier versions of this project could generate convincing futures without really following the supplied action sequence. That was the central bottleneck. The current setup appears to have resolved that issue well enough for public beta release.
 That said, the model is still not perfect. Remaining weaknesses include:
+- small-object confusion.
+- some inventory and HUD detail errors.
+- considerable object-location drift.
+- mixing of similar sprites or structures.
+- degradation over longer autoregressive rollouts (can sometimes end stuck surrounded by stone).
 So this is a **serious, working beta research system**, not a final benchmarked product.
 ---
+## Hardware note:
 A major goal of this project is to show that meaningful world-model research can be done on **consumer hardware**.
 This work was trained on **a single RTX 3090 (24 GB)**.
+The setup should also be feasible on a **3060/(12 GB) class GPUs** with smaller microbatches at the cost of training speed.
 ---
+## Project goal:
 The immediate goal is to learn a world model that can:
+1. compress Crafter observations into useful latent tokens.
+2. model future latent dynamics conditioned on actions.
+3. produce multi-step rollouts that are both visually coherent and action-faithful.
 The longer-term goal is to use these learned dynamics for:
+- planning.
+- control.
+- reinforcement learning in imagination.
+- eventually more general agents that can reason over imagined futures.
 ---
+## Relation to prior work:
 This project is strongly inspired by recent scalable world-model work, especially the combination of:
+- causal or masked video tokenizers.
+- latent-space dynamics models.
+- action-conditioned rollout generation.
+- evaluation through rollout quality and action sensitivity rather than reconstruction alone.
 It is **inspired by Dreamer-4 style work**, but it is **not a full reproduction**, and it currently covers **only the world-modeling part of the pipeline**, not the later RL agent-training phase.
 It also draws from work on:
+- Crafter as a benchmark for sparse-reward, compositional environments.
+- masked autoencoders as tokenizers for generative models.
+- diffusion / shortcut-style training for latent dynamics.
 ---
+## What is included:
 This repository currently includes code and assets for:
+- **Crafter data collection**.
+- **causal MAE tokenizer pretraining**.
+- **latent world-model pretraining**.
+- **evaluation and diagnostics**.
+- **interactive imagination demo / web-app deployment path**.
+- **exported checkpoints in PyTorch, safetensors, and ONNX formats**.
 It also includes supporting outputs such as:
+- rollout visualisations.
+- validation plots.
+- action sensitivity plots.
+- failure-mode examples.
+- exported checkpoints under `checkpoints/`.
 ---
+## Included checkpoints and exports:
 The current exported files live under `checkpoints/` and include:
+- `mae_model.safetensors`.
+- `mae_decode.onnx`.
+- `world_model.safetensors`.
+- `world_model_ema.safetensors`.
+- `world_model.onnx`.
 These are intended to support:
+- direct checkpoint download.
+- lightweight inference experiments.
+- Hugging Face Spaces deployment.
+- future ONNX Runtime / browser / API-based demos.
 ---
+## Interactive demo / app:
 This repo includes a usable interactive demo path for testing the world model as an imagined game.
 The basic idea is:
+- start from a real context window from the Crafter dataset.
+- choose an action.
+- predict the next latent frame with the world model.
+- decode it through the MAE decoder.
+- feed the prediction back into context.
+- continue rolling forward open-loop.
 This is not the real Crafter environment running underneath. It is the **model’s imagined continuation** of the game.
 The Hugging Face Space version is intended to make this easy to test without needing to run the training code locally.
+### Intended controls:
+- **Arrow keys / WASD**: movement.
+- **Space**: interact / do.
+- **Tab**: noop.
+- **Shift**: sleep.
+- **1–0**: place / craft actions.
+- **R**: reset.
+- **G**: save gif/json in the local demo version.
 ---
+## Training data:
 The current beta model was trained primarily on **Crafter human expert data**.
+That choice was deliberate. At this stage, I wanted to maximize the density of meaningful action-conditioned transitions. The full plan is to make this setup fairly game agnostic with random policies, train in imagination, gather data with better policy retrain forming a feedback loop.
+A later stage of the project will revisit broader or more mixed data collection, including more game-agnostic or random-policy style data, but the current release is mainly built around the human expert regime. This is only 100 episodes of human expert gameplay so not a huge dataset by any means.
 ---
+## Data collection policy:
 This repository also includes my current **Crafter data-collection policy code**, which was designed to improve action-conditioned learning by increasing the fraction of transitions where actions produce meaningful state changes.
 Key ideas include:
+- stuck detection through frame-change heuristics.
+- forced interaction bursts when the agent appears stuck.
+- periodic interleaving of interaction actions during movement.
+- adaptive exploration behaviour.
+- cleaner action classification logic to avoid action-name matching bugs.
+- shard-based storage with episode metadata, gifs, and achievement events.
+This was all mainly to improve achievement coverage while remaining as a game-like agnostic loop (explore, try interact, try craft). It get's on average 10/22 achievements with some shards reaching 14.
 ---
+## Main components:
+## 1. Causal MAE tokenizer:
 The tokenizer is a **causal masked autoencoder** trained on Crafter frame sequences.
 Main properties:
+- independent masking across frames (possible experiments with higher masking ratios with tube masking).
+- spatial self-attention within frames.
+- periodic temporal causal attention.
+- bottlenecked latent representation.
+- MAE-style masked reconstruction objective.
+- LPIPS-assisted reconstruction training.
+- latent outputs intended for downstream world modeling, not just pretty decoding.
 A major lesson from this project so far is that the tokenizer matters a lot more than it may seem at first.
 In particular:
+- **high masking turned out to be important**.
+- lower masking can give cleaner-looking reconstructions while producing **worse downstream action sensitivity**.
+- decoder quality alone is not a sufficient measure of whether the latent space is good for dynamics.
 So although the decoder is mostly used for visualisation, the **latent space quality is still critical**, because the world model operates in that latent space.
 ---
+## 2. Latent world model:
 The world model is trained in latent space using an action-conditioned architecture based around a DiT-style token backbone.
 Current ingredients include:
+- action-conditioned latent prediction.
+- shortcut-forcing style training.
+- bucketed context / prediction-length sampling.
+- autoregressive rollout evaluation.
+- action-sensitivity diagnostics.
+- EMA checkpointing.
+- validation across multiple `(context, prediction)` regimes.
 The world model now produces:
+- coherent future rollouts.
+- much better action sensitivity than earlier versions.
+- usable imagined-game behaviour in open loop.
 This is the main milestone that moved the project into beta.
 ---
+## 3. Diagnostics and evaluation:
 I track progress with multiple diagnostics rather than relying on training loss alone.
 These include:
+- fixed-noise / denoising validation.
+- ODE-style reconstruction validation.
+- autoregressive rollout evaluation.
+- action sensitivity evaluation.
+- rollout gifs.
+- failure-case inspection.
+- multi-regime validation over several context/prediction bucket pairs.
 This matters because a model can look good in one metric while still failing in the behaviour I actually care about.
 ---
+## 4. Latent-space analysis:
 I am also investigating better ways to reason about what makes a **good latent space** for downstream world modeling.
 The current exploratory tooling includes:
+- **UMAP** visualisation of latent structure.
+- **GMM** complexity analysis over latent features.
+- checkpoint-to-checkpoint latent comparisons.
 At the moment this remains exploratory. I do not yet think I fully understand how to interpret these plots in a way that is directly actionable for world-model training, but I think it is an important direction.
 There is space in this repo for that analysis to become much more systematic over time.
+Below is an example of a UMAP on a latent space known to have good and bad action sensitivity. Not sure these are the best to really probe such spaces though:
+### Example latent analysis:
+![UMAP of good latent space](images/umap_good.png)
+![UMAP of bad latent space](images/umap_good.png)
+![UMAP of good latent space with examples](images/umap_good_examples.png)
+I need to properly go over these models. I think the GMM is currently setup incorrectly and the UMAP isn't really comparable as they're on different splits.
 ---
+## Current strengths:
 The current beta model already shows several encouraging properties:
+- coherent latent rollouts.
+- meaningful action conditioning.
+- usable open-loop imagination.
+- multi-step rollout generation.
+- stable training on a single consumer GPU.
+- a clear path to demo deployment through Hugging Face Spaces.
 ---
+## Current failure modes:
 The project is still very much an active research system, and several failure modes remain important.
+### Failure modes:
 Typical world-model failures include:
+- Snapping away from chosen direction.
+- Object-position drift across rollout steps.
+- Memory of space (can lock you in a wall of stone).
+- Rare details disappearing (tokenizer issue not sure if it's the encoder or decoder).
+- NPC's and tiles becoming blurry or unstable and/or swapping places (moving sand/stone or coal).
+- Confusion between furnaces, crafting tables, and similar sprites
+- Imperfect preservation of object identity
+- Occasional loss of fine HUD detail
+### Example failure cases:
+![Failure mode arrows and npc swap](images/arrows_npc_swap_issues.png)
+![Failure mode enemy, furnace and tree blurring](images/enemy_furnace_tree_issues.png)
+![Failure mode furnace and crafting issues](images/furnace_crafting_issues.png)
+![Failure mode action following and object position](images/action_following_and_object_position.gif)
 These examples are included deliberately. I do not want the repo to present only the successes. The failure modes are a major part of the research story.
 ---
+## Why the tokenizer matters so much:
 One of the clearest takeaways from this work is that a tokenizer can look visually decent while still being a poor substrate for dynamics learning.
 A world model does not need the prettiest decoder output. It needs latents that preserve the distinctions required for:
+- causality.
+- controllability.
+- object identity.
+- local change.
+- action consequence.
 That is why masking level, bottleneck structure, and latent organisation matter so much here.
 ---
+## Repository state:
 A few caveats up front:
+- the training code is still a bit messy.
+- some scripts were written for active notebook-based iteration.
+- local paths may need editing before reuse.
+- there are still older comments, experimental branches, and rough edges.
+- names and interfaces may change as the project is cleaned up.
 I am still sharing it because the core technical direction is now clear and useful.
 ---
+## Intended direction:
 My aim is for this repository to become a strong base for other researchers who want to work on:
+- world models.
+- latent dynamics.
+- imagination-based planning.
+- action-conditioned generative models.
+- model-based RL on consumer hardware.
 Over time I want this project to include:
+- cleaner training scripts.
+- clearer explanations of each component.
+- more structured ablations.
+- better evaluation tools.
+- fuller reproduction instructions.
+- eventual downstream agent-training in imagination.
 ---
+## Scope of this release:
 This release should be understood as:
+- a **beta research release**.
+- a **working action-conditioned world model**.
+- a **portfolio / research artifact**.
+- a **foundation for future planning and RL work**.
 It should **not** be understood as:
+- a polished library.
+- a final benchmark result.
+- a full Dreamer-4 reproduction.
+- a complete end-to-end agent-training system.
 ---
+## If you want to explore the project:
 Good places to start are:
+- the exported checkpoints under `checkpoints/`.
+- the demo / app.
+- the tokenizer training code.
+- the world-model training code.
+- the validation plots and rollout gifs.
+- the failure-mode examples.
 ---
+## Acknowledgements:
 This project was strongly influenced by several pieces of prior work:
 ---
+## Citation:
 If this repository is useful to your work, please cite the repository and the relevant upstream papers.
 ---
+## Status:
 **Beta. Active research. Action sensitivity fixed in the current setup, with further training, testing, cleanup, and longer-horizon improvement still in progress.**