# Research Pipeline: Model Exploitation in World-Model RL on Hard-Exploration Tasks > **Working title:** *"When the World Model Lies: Reward Exploitation in DreamerV3 under Sparse Feedback"* > Last updated: 2026-03-02 --- ## 1. Research Question & Hypothesis **Core question:** Can a world-model agent (DreamerV3) exploit an inaccurate learned world model to achieve high *imagined* rewards while failing at the actual task — and can we measure, characterize, and mitigate this gap? **Hypothesis:** - In sparse-reward, hard-exploration environments (AntMaze), the RSSM world model receives near-zero reward signal from real experience. - The policy therefore optimizes reward *inside* the model's imagination, where the model has large uncertainty. - This produces a measurable **exploitation gap**: `imag_reward_mean >> env_reward_mean`, while `eval_return ≈ 0`. - The gap is correlated with low KL divergence (`kl_loss`) — the model becomes overconfident and the policy exploits its blind spots. **Baseline comparison:** THICK (Gumbsch et al., ICLR 2024) uses temporal hierarchies and context kernels on top of DreamerV2 — does the hierarchical structure reduce or amplify model exploitation? --- ## 2. Environments ### 2.1 AntMaze (Primary) | Variant | Gym ID | Reward | Steps budget | |---|---|---|---| | Medium Diverse | `AntMaze_Medium_Diverse_GR-v5` | Sparse {0, 1} | 500k (DreamerV3), 1M (THICK) | | Large Diverse | `AntMaze_Large_Diverse_GR-v5` | Sparse {0, 1} | 2M (THICK) | | Medium Dense | `AntMaze_MediumDense-v5` | Dense | optional ablation | **Observation space:** 107-dim proprioceptive vector = 105-dim ant state + 2-dim desired goal **Action space:** 8-dim continuous (ant joint torques), clipped to [-1, 1] **Episode length:** 1000 steps **Success criterion:** ant reaches goal within 0.5m (binary, end of episode) **Why AntMaze for exploitation research:** - Sparse reward → world model sees `reward=0` for almost all experience → learns a "flat" reward landscape → imagination can assign arbitrary rewards to unseen states - Large state space → model uncertainty is high far from visited states - Clear ground truth (success rate) separates real vs. imagined performance ### 2.2 MiniHack KeyCorridor (THICK baseline only) - Env: `MiniHack-KeyCorridor-S6-R3-v0` - Sparse reward: +1 on reaching goal, 0 otherwise - Partial observability (21×21 egocentric grid, 3-channel) - Steps budget: 1M ### 2.3 Crafter (THICK baseline only) - Task: `crafter_reward` (technology-tree exploration) - Dense shaped reward: sum of achievements unlocked - Steps budget: 1M --- ## 3. Methods ### 3.1 DreamerV3-torch (Proposed Analysis) **Repository:** `D:/reaserch/dreamerv3-torch/` (fork of NM512/dreamerv3-torch) **Framework:** PyTorch 2.5.1 + CUDA 12.1 #### Architecture ``` Encoder (CNN/MLP) │ ▼ RSSM (Recurrent State Space Model) ├── GRU deterministic state h_t (deter=512) └── Categorical stochastic state z_t (stoch=32 × discrete=32) │ ├── Reward head MLP → predicted reward r̂_t ├── Continuation head MLP → γ̂_t (done probability) └── Decoder (CNN/MLP) → reconstructed obs │ ▼ Actor-Critic (operates in imagination) ├── Actor: tanh-normal policy π(a|h,z) └── Critic: symlog-discrete value V(h,z) ``` **Key hyperparameters (antmaze config):** | Parameter | Value | Note | |---|---|---| | `dyn_hidden` | 256 | reduced for 3GB VRAM | | `dyn_deter` | 256 | GRU hidden size | | `dyn_stoch` | 32 | categorical dimensions | | `dyn_discrete` | 32 | categories per dim | | `batch_size` | 16 | | | `batch_length` | 50 | sequence length | | `imagination_horizon` | 15 | actor-critic rollout | | `kl_free` | 1.0 | KL free bits | | `rep_scale` | 0.1 | representation loss weight | | `dyn_scale` | 0.5 | dynamics loss weight | | `actor.lr` | 3e-5 | | | `critic.lr` | 3e-5 | | **Modified files:** | File | Change | |---|---| | `envs/antmaze.py` | New: gymnasium-robotics wrapper → dreamerv3 obs dict | | `dreamer.py` | Added `antmaze` suite to `make_env()`; MUJOCO_GL fix | | `models.py` | Added `env_reward_mean/std` metrics to `WorldModel.loss()` | | `configs.yaml` | Added `antmaze` profile | --- ### 3.2 THICK (Baseline) **Paper:** Gumbsch et al., *"Temporal Hierarchies from Invariant Context Kernels"*, ICLR 2024 **Repository:** `D:/reaserch/THICK/` (CognitiveModeling/THICK) **Framework:** TensorFlow 2.14.1 + CUDA 12 #### Architecture (on top of DreamerV2) ``` Encoder (CNN/MLP) │ ▼ Context Network (GateL0RD RNN) └── produces sparse context vector c_t (context=16) │ ├── RSSM (standard DreamerV2) │ └── conditioned on c_t │ └── THICK hierarchical policy ├── High-level action a_hl every N steps (learned N) └── Low-level actor conditions on a_hl ``` **Key differences from DreamerV3:** - Based on DreamerV2 (not V3): no symlog, no KL balancing - Context kernel produces segment boundaries (when to update high-level action) - Designed for tasks requiring temporal abstraction (key-then-door, multi-step navigation) **Modified files for AntMaze:** | File | Change | |---|---| | `thick/common/envs.py` | Added `AntMaze` class | | `thick/common/env_register.py` | Added `antmaze` suite routing | | `thick/configs.yaml` | Added `thick_antmaze_medium`, `thick_antmaze_large`, `thick_crafter` | --- ## 4. Metrics ### 4.1 Model Exploitation Metrics (DreamerV3 — main contribution) These are logged to TensorBoard every `log_every=10k` steps. | Metric | TensorBoard tag | Description | |---|---|---| | **Exploitation gap** | computed | `imag_reward_mean - env_reward_mean` — primary indicator | | `imag_reward_mean` | `imag_reward_mean` | Mean reward during actor's imagination rollouts (world model's internal belief) | | `env_reward_mean` | `env_reward_mean` | Mean actual reward in the replay buffer (ground truth) | | `env_reward_std` | `env_reward_std` | Std of actual rewards (stays ≈ 0 during early sparse training) | | `kl` | `kl` | KL divergence posterior‖prior; low KL → overconfident model → exploitation | | `reward_loss` | `reward_loss` | World model reward head loss (symlog MSE) | | `eval_return` | `eval_return` | Actual evaluation return (expected ≈ 0 on AntMaze sparse throughout) | **Expected pattern showing exploitation:** ``` Steps: 0 100k 200k 300k 400k 500k imag_reward_mean: 0.0 0.1 0.4 0.8 1.2 1.5 ← rises env_reward_mean: 0.0 0.0 0.01 0.01 0.02 0.03 ← stays near 0 gap: 0.0 0.1 0.39 0.79 1.18 1.47 ← widens kl: 1.0 0.8 0.5 0.3 0.2 0.15 ← drops eval_return: 0.0 0.0 0.0 0.01 0.01 0.02 ← stays near 0 ``` ### 4.2 Baseline Performance Metrics (THICK) | Metric | TensorBoard tag | Environments | |---|---|---| | Success rate | `log_success` | AntMaze, MiniHack | | Episode reward | `log_reward` | Crafter | | **Success@budget** | computed | Mean success over last 10% of steps | | **Steps→50%** | computed | Steps until rolling-mean success ≥ 0.5 | **Aggregation:** mean ± std over 5 seeds (seeds 0–4) ### 4.3 Paper Table Format ``` | Method | AntMaze-Med | AntMaze-Lg | MiniHack-KC | Crafter | | | succ@1M | succ@2M | succ@1M | reward@1M | |---------------|-----------------|-----------------|-----------------|-----------------| | THICK | 0.xxx ± 0.xxx | 0.xxx ± 0.xxx | 0.xxx ± 0.xxx | xx.x ± xx.x | | DreamerV3 | 0.xxx ± 0.xxx | — | — | — | | + our method | 0.xxx ± 0.xxx | 0.xxx ± 0.xxx | — | — | ``` --- ## 5. Training Pipelines ### 5.1 DreamerV3 — AntMaze ```bash # Activate environment conda activate dreamer cd D:/reaserch/dreamerv3-torch # Single run (500k steps, seed 0) PYTHONUNBUFFERED=1 python -u dreamer.py \ --configs antmaze \ --task antmaze_medium-play \ --logdir D:/reaserch/logdir/antmaze_medium/seed0 \ --seed 0 # Multi-seed loop (5 seeds) for SEED in 0 1 2 3 4; do PYTHONUNBUFFERED=1 python -u dreamer.py \ --configs antmaze \ --task antmaze_medium-play \ --logdir D:/reaserch/logdir/antmaze_medium/seed${SEED} \ --seed ${SEED} done ``` **Using the convenience script:** ```bash bash D:/reaserch/run_antmaze.sh ``` ### 5.2 THICK — All Baselines ```bash conda activate dreamer # same env, TF + torch both installed cd D:/reaserch/THICK # Sequential (one GPU, ~5 days total) bash run_baseline.sh \ /c/Users/user/anaconda3/envs/dreamer/python.exe \ D:/reaserch/logdir/thick_baseline \ "0 1 2 3 4" # Or parallel (requires multiple GPUs / be careful with VRAM): bash run_baseline_parallel.sh ``` **Single experiment:** ```bash python -u thick/train.py \ --configs thick_antmaze_medium \ --task antmaze_medium-diverse \ --seed 0 \ --steps 1000000 \ --logdir D:/reaserch/logdir/thick_baseline/thick_antmaze_medium/seed0 \ --use_wandb False ``` ### 5.3 Monitoring — TensorBoard ```bash conda run -n dreamer tensorboard \ --logdir D:/reaserch/logdir/ \ --port 6006 # Open: http://localhost:6006 ``` **Key panels to watch:** - `train/imag_reward_mean` vs `train/env_reward_mean` — exploitation gap - `train/kl` — model confidence - `eval/eval_return` — actual task performance - `train/reward_loss` — world model accuracy ### 5.4 Evaluation — Extract Metrics Table ```bash cd D:/reaserch/THICK python eval_baseline.py \ --logdir D:/reaserch/logdir/thick_baseline \ --out D:/reaserch/results/thick_baseline_table.md ``` Output: Markdown table with `success@budget (mean±std)` and `steps→50% (mean±std)`. --- ## 6. Repository Structure ``` D:/reaserch/ ├── dreamerv3-torch/ # Main method (DreamerV3 + exploitation analysis) │ ├── dreamer.py # Entry point: training loop, make_env() │ ├── models.py # WorldModel, ImagBehavior, exploitation metrics │ ├── networks.py # RSSM, MultiEncoder, MLP layers │ ├── configs.yaml # Hyperparameter profiles (incl. antmaze) │ ├── envs/ │ │ ├── antmaze.py # *** AntMaze wrapper (our addition) *** │ │ ├── dmc.py # DeepMind Control Suite │ │ └── wrappers.py # TimeLimit, NormalizeAction, etc. │ └── run_antmaze.sh # Convenience launcher │ ├── THICK/ # Baseline (THICK, ICLR 2024) │ ├── thick/ │ │ ├── train.py # Entry point │ │ ├── agent.py # THICK agent (DreamerV2 + context) │ │ ├── configs.yaml # Profiles (incl. antmaze, crafter) │ │ └── common/ │ │ ├── envs.py # *** AntMaze class (our addition) *** │ │ └── env_register.py # *** antmaze suite routing (our addition) *** │ ├── run_baseline.sh # Sequential 5-seed runner │ ├── run_baseline_parallel.sh # Parallel variant │ ├── eval_baseline.py # Metrics aggregation → paper table │ └── requirements_cuda12.txt # TF 2.14 + CUDA 12 deps │ ├── logdir/ # All TensorBoard logs (gitignored) │ ├── antmaze_medium/seed{0-4}/ │ └── thick_baseline/ │ ├── thick_antmaze_medium/seed{0-4}/ │ ├── thick_antmaze_large/seed{0-4}/ │ ├── thick_minihack/seed{0-4}/ │ └── thick_crafter/seed{0-4}/ │ ├── results/ # Paper-ready outputs │ └── thick_baseline_table.md │ └── RESEARCH_PIPELINE.md # This file ``` --- ## 7. Software Stack ### Conda Environment: `dreamer` | Package | Version | Role | |---|---|---| | Python | 3.10 | | | PyTorch | 2.5.1+cu121 | DreamerV3 training | | TensorFlow | 2.14.1 (and-cuda) | THICK training | | CUDA | 12.9 (driver) / 12.1 (torch) | GPU compute | | MuJoCo | 3.5.0 | Physics simulation | | gymnasium | 1.2.3 | Env API | | gymnasium-robotics | 1.4.2 | AntMaze environments | | gym | 0.22.0 | Legacy API (THICK compat) | | numpy | 1.23.5 | | | tensorboard | latest | Logging + monitoring | ### Installation ```bash # Create env conda create -n dreamer python=3.10 conda activate dreamer # PyTorch (CUDA 12.1 wheels) pip install torch==2.5.1+cu121 torchvision --index-url https://download.pytorch.org/whl/cu121 # MuJoCo + environments pip install mujoco==3.5.0 pip install gymnasium==1.2.3 gymnasium-robotics==1.4.2 pip install gym==0.22.0 # TensorFlow for THICK (CUDA 12 compatible) pip install -r D:/reaserch/THICK/requirements_cuda12.txt # Other pip install tensorboard ruamel.yaml opencv-python einops ``` ### Hardware | Resource | Spec | |---|---| | GPU | NVIDIA GTX 1050, 3GB VRAM | | OS | Windows 10 Pro (MINGW64 / Git Bash) | | CUDA driver | 12.9 | **VRAM budget:** - DreamerV3 antmaze config: ~2.4GB (reduced `dyn_hidden=256`, `batch_size=16`) - THICK antmaze: ~2.6GB (default THICK nets, batch=16, length=50) --- ## 8. Experimental Protocol ### 8.1 Seeds & Reproducibility - All experiments: **5 seeds** (0, 1, 2, 3, 4) - Seed controls: Python `random`, NumPy, PyTorch/TF global seeds, gymnasium env seed - Set via `--seed N` in both codebases ### 8.2 Evaluation Protocol **DreamerV3:** - Eval every `eval_every=10k` environment steps - `eval_episode_num=10` episodes per eval - Metric: mean episode return across 10 eval episodes **THICK:** - Eval every `eval_every=1e5` steps - `eval_eps=1` (single eval episode per checkpoint) - Success logged as `log_success` in TensorBoard ### 8.3 Data Collection for Paper ```bash # Step 1: Run all THICK baselines bash D:/reaserch/THICK/run_baseline.sh \ /c/Users/user/anaconda3/envs/dreamer/python.exe \ D:/reaserch/logdir/thick_baseline # Step 2: Run DreamerV3 on AntMaze (5 seeds) bash D:/reaserch/run_antmaze.sh # Step 3: Generate THICK results table python D:/reaserch/THICK/eval_baseline.py \ --logdir D:/reaserch/logdir/thick_baseline \ --out D:/reaserch/results/thick_baseline_table.md # Step 4: Plot exploitation gap curves # (TensorBoard → export CSV → matplotlib) tensorboard --logdir D:/reaserch/logdir/antmaze_medium --port 6006 # Navigate to imag_reward_mean, env_reward_mean → download CSV ``` --- ## 9. Key Findings (Running Notes) > Update this section as experiments complete. ### Expected / Hypothesized - **AntMaze sparse** → exploitation gap emerges within 100-200k steps - **KL drop** precedes exploitation gap growth by ~50k steps (leading indicator) - **THICK** may show smaller gap due to context gating (sparse context updates = less overfitting) - **Dense AntMaze** (ablation) → no exploitation gap (reward signal rich enough to constrain model) ### Observed *(Fill in after runs complete)* --- ## 10. Paper Outline (Draft) 1. **Introduction** — world model agents + sparse reward = uncharted exploitation territory 2. **Background** — DreamerV3 RSSM, imagination rollouts, KL regularization 3. **Exploitation Gap Formalization** — define gap, theoretical conditions for it to grow 4. **Experimental Setup** — AntMaze variants, THICK baseline, metrics 5. **Results** - 5a. Exploitation gap dynamics over training - 5b. KL as leading indicator - 5c. THICK vs DreamerV3 gap comparison - 5d. Dense vs sparse reward ablation 6. **Analysis** — which model components drive exploitation 7. **Mitigation** — (if applicable) e.g. KL annealing, reward uncertainty bonus 8. **Conclusion** --- ## 11. Gotchas & Known Issues | Issue | Cause | Fix | |---|---|---| | `MUJOCO_GL` crash on Windows | MuJoCo tries EGL/GLX | Set `MUJOCO_GL=osmesa` in `dreamer.py` | | TF 2.4 CUDA incompatible | Original THICK reqs | Use `requirements_cuda12.txt` (TF 2.14) | | `conda run -n dreamer python -c "..."` fails | Multiline args in Git Bash | Write commands to `.py` file, run that | | `eval_return≈0` throughout | AntMaze sparse, hard exploration | Expected — track `imag_reward_mean` instead | | AntMaze v5 obs dim = 107 | v5 changed from 29-dim (v4) | Verified: 105 ant + 2 goal = 107 | | Low GPU util during THICK | TF graph compilation on first batch | Normal — wait ~5min for warmup | | `log_success` not found in TB | THICK uses `log_keys_sum` | Check `log_every` config; may need `log_keys_sum: 'log_success'` |