# Research Pipeline: Model Exploitation in World-Model RL on Hard-Exploration Tasks

> **Working title:** *"When the World Model Lies: Reward Exploitation in DreamerV3 under Sparse Feedback"*
> Last updated: 2026-03-02

---

## 1. Research Question & Hypothesis

**Core question:** Can a world-model agent (DreamerV3) exploit an inaccurate learned world model to achieve high *imagined* rewards while failing at the actual task — and can we measure, characterize, and mitigate this gap?

**Hypothesis:**
- In sparse-reward, hard-exploration environments (AntMaze), the RSSM world model receives near-zero reward signal from real experience.
- The policy therefore optimizes reward *inside* the model's imagination, where the model has large uncertainty.
- This produces a measurable **exploitation gap**: `imag_reward_mean >> env_reward_mean`, while `eval_return ≈ 0`.
- The gap is correlated with low KL divergence (`kl_loss`) — the model becomes overconfident and the policy exploits its blind spots.

**Baseline comparison:** THICK (Gumbsch et al., ICLR 2024) uses temporal hierarchies and context kernels on top of DreamerV2 — does the hierarchical structure reduce or amplify model exploitation?

---

## 2. Environments

### 2.1 AntMaze (Primary)

| Variant | Gym ID | Reward | Steps budget |
|---|---|---|---|
| Medium Diverse | `AntMaze_Medium_Diverse_GR-v5` | Sparse {0, 1} | 500k (DreamerV3), 1M (THICK) |
| Large Diverse  | `AntMaze_Large_Diverse_GR-v5`  | Sparse {0, 1} | 2M (THICK) |
| Medium Dense   | `AntMaze_MediumDense-v5`       | Dense          | optional ablation |

**Observation space:** 107-dim proprioceptive vector = 105-dim ant state + 2-dim desired goal
**Action space:** 8-dim continuous (ant joint torques), clipped to [-1, 1]
**Episode length:** 1000 steps
**Success criterion:** ant reaches goal within 0.5m (binary, end of episode)

**Why AntMaze for exploitation research:**
- Sparse reward → world model sees `reward=0` for almost all experience → learns a "flat" reward landscape → imagination can assign arbitrary rewards to unseen states
- Large state space → model uncertainty is high far from visited states
- Clear ground truth (success rate) separates real vs. imagined performance

### 2.2 MiniHack KeyCorridor (THICK baseline only)

- Env: `MiniHack-KeyCorridor-S6-R3-v0`
- Sparse reward: +1 on reaching goal, 0 otherwise
- Partial observability (21×21 egocentric grid, 3-channel)
- Steps budget: 1M

### 2.3 Crafter (THICK baseline only)

- Task: `crafter_reward` (technology-tree exploration)
- Dense shaped reward: sum of achievements unlocked
- Steps budget: 1M

---

## 3. Methods

### 3.1 DreamerV3-torch (Proposed Analysis)

**Repository:** `D:/reaserch/dreamerv3-torch/` (fork of NM512/dreamerv3-torch)
**Framework:** PyTorch 2.5.1 + CUDA 12.1

#### Architecture

```
Encoder (CNN/MLP)
    │
    ▼
RSSM (Recurrent State Space Model)
 ├── GRU deterministic state h_t  (deter=512)
 └── Categorical stochastic state z_t  (stoch=32 × discrete=32)
    │
    ├── Reward head MLP  →  predicted reward r̂_t
    ├── Continuation head MLP  →  γ̂_t (done probability)
    └── Decoder (CNN/MLP)  →  reconstructed obs
    │
    ▼
Actor-Critic (operates in imagination)
 ├── Actor: tanh-normal policy π(a|h,z)
 └── Critic: symlog-discrete value V(h,z)
```

**Key hyperparameters (antmaze config):**

| Parameter | Value | Note |
|---|---|---|
| `dyn_hidden` | 256 | reduced for 3GB VRAM |
| `dyn_deter` | 256 | GRU hidden size |
| `dyn_stoch` | 32 | categorical dimensions |
| `dyn_discrete` | 32 | categories per dim |
| `batch_size` | 16 | |
| `batch_length` | 50 | sequence length |
| `imagination_horizon` | 15 | actor-critic rollout |
| `kl_free` | 1.0 | KL free bits |
| `rep_scale` | 0.1 | representation loss weight |
| `dyn_scale` | 0.5 | dynamics loss weight |
| `actor.lr` | 3e-5 | |
| `critic.lr` | 3e-5 | |

**Modified files:**

| File | Change |
|---|---|
| `envs/antmaze.py` | New: gymnasium-robotics wrapper → dreamerv3 obs dict |
| `dreamer.py` | Added `antmaze` suite to `make_env()`; MUJOCO_GL fix |
| `models.py` | Added `env_reward_mean/std` metrics to `WorldModel.loss()` |
| `configs.yaml` | Added `antmaze` profile |

---

### 3.2 THICK (Baseline)

**Paper:** Gumbsch et al., *"Temporal Hierarchies from Invariant Context Kernels"*, ICLR 2024
**Repository:** `D:/reaserch/THICK/` (CognitiveModeling/THICK)
**Framework:** TensorFlow 2.14.1 + CUDA 12

#### Architecture (on top of DreamerV2)

```
Encoder (CNN/MLP)
    │
    ▼
Context Network (GateL0RD RNN)
 └── produces sparse context vector c_t  (context=16)
    │
    ├── RSSM (standard DreamerV2)
    │    └── conditioned on c_t
    │
    └── THICK hierarchical policy
         ├── High-level action a_hl every N steps (learned N)
         └── Low-level actor conditions on a_hl
```

**Key differences from DreamerV3:**
- Based on DreamerV2 (not V3): no symlog, no KL balancing
- Context kernel produces segment boundaries (when to update high-level action)
- Designed for tasks requiring temporal abstraction (key-then-door, multi-step navigation)

**Modified files for AntMaze:**

| File | Change |
|---|---|
| `thick/common/envs.py` | Added `AntMaze` class |
| `thick/common/env_register.py` | Added `antmaze` suite routing |
| `thick/configs.yaml` | Added `thick_antmaze_medium`, `thick_antmaze_large`, `thick_crafter` |

---

## 4. Metrics

### 4.1 Model Exploitation Metrics (DreamerV3 — main contribution)

These are logged to TensorBoard every `log_every=10k` steps.

| Metric | TensorBoard tag | Description |
|---|---|---|
| **Exploitation gap** | computed | `imag_reward_mean - env_reward_mean` — primary indicator |
| `imag_reward_mean` | `imag_reward_mean` | Mean reward during actor's imagination rollouts (world model's internal belief) |
| `env_reward_mean` | `env_reward_mean` | Mean actual reward in the replay buffer (ground truth) |
| `env_reward_std` | `env_reward_std` | Std of actual rewards (stays ≈ 0 during early sparse training) |
| `kl` | `kl` | KL divergence posterior‖prior; low KL → overconfident model → exploitation |
| `reward_loss` | `reward_loss` | World model reward head loss (symlog MSE) |
| `eval_return` | `eval_return` | Actual evaluation return (expected ≈ 0 on AntMaze sparse throughout) |

**Expected pattern showing exploitation:**
```
Steps:              0    100k   200k   300k   400k   500k
imag_reward_mean:  0.0   0.1    0.4    0.8    1.2    1.5   ← rises
env_reward_mean:   0.0   0.0    0.01   0.01   0.02   0.03  ← stays near 0
gap:               0.0   0.1    0.39   0.79   1.18   1.47  ← widens
kl:                1.0   0.8    0.5    0.3    0.2    0.15  ← drops
eval_return:       0.0   0.0    0.0    0.01   0.01   0.02  ← stays near 0
```

### 4.2 Baseline Performance Metrics (THICK)

| Metric | TensorBoard tag | Environments |
|---|---|---|
| Success rate | `log_success` | AntMaze, MiniHack |
| Episode reward | `log_reward` | Crafter |
| **Success@budget** | computed | Mean success over last 10% of steps |
| **Steps→50%** | computed | Steps until rolling-mean success ≥ 0.5 |

**Aggregation:** mean ± std over 5 seeds (seeds 0–4)

### 4.3 Paper Table Format

```
| Method        | AntMaze-Med     | AntMaze-Lg      | MiniHack-KC     | Crafter         |
|               | succ@1M         | succ@2M         | succ@1M         | reward@1M       |
|---------------|-----------------|-----------------|-----------------|-----------------|
| THICK         | 0.xxx ± 0.xxx   | 0.xxx ± 0.xxx   | 0.xxx ± 0.xxx   | xx.x ± xx.x     |
| DreamerV3     | 0.xxx ± 0.xxx   | —               | —               | —               |
| + our method  | 0.xxx ± 0.xxx   | 0.xxx ± 0.xxx   | —               | —               |
```

---

## 5. Training Pipelines

### 5.1 DreamerV3 — AntMaze

```bash
# Activate environment
conda activate dreamer
cd D:/reaserch/dreamerv3-torch

# Single run (500k steps, seed 0)
PYTHONUNBUFFERED=1 python -u dreamer.py \
  --configs antmaze \
  --task antmaze_medium-play \
  --logdir D:/reaserch/logdir/antmaze_medium/seed0 \
  --seed 0

# Multi-seed loop (5 seeds)
for SEED in 0 1 2 3 4; do
  PYTHONUNBUFFERED=1 python -u dreamer.py \
    --configs antmaze \
    --task antmaze_medium-play \
    --logdir D:/reaserch/logdir/antmaze_medium/seed${SEED} \
    --seed ${SEED}
done
```

**Using the convenience script:**
```bash
bash D:/reaserch/run_antmaze.sh
```

### 5.2 THICK — All Baselines

```bash
conda activate dreamer   # same env, TF + torch both installed
cd D:/reaserch/THICK

# Sequential (one GPU, ~5 days total)
bash run_baseline.sh \
  /c/Users/user/anaconda3/envs/dreamer/python.exe \
  D:/reaserch/logdir/thick_baseline \
  "0 1 2 3 4"

# Or parallel (requires multiple GPUs / be careful with VRAM):
bash run_baseline_parallel.sh
```

**Single experiment:**
```bash
python -u thick/train.py \
  --configs thick_antmaze_medium \
  --task antmaze_medium-diverse \
  --seed 0 \
  --steps 1000000 \
  --logdir D:/reaserch/logdir/thick_baseline/thick_antmaze_medium/seed0 \
  --use_wandb False
```

### 5.3 Monitoring — TensorBoard

```bash
conda run -n dreamer tensorboard \
  --logdir D:/reaserch/logdir/ \
  --port 6006

# Open: http://localhost:6006
```

**Key panels to watch:**
- `train/imag_reward_mean` vs `train/env_reward_mean` — exploitation gap
- `train/kl` — model confidence
- `eval/eval_return` — actual task performance
- `train/reward_loss` — world model accuracy

### 5.4 Evaluation — Extract Metrics Table

```bash
cd D:/reaserch/THICK

python eval_baseline.py \
  --logdir D:/reaserch/logdir/thick_baseline \
  --out D:/reaserch/results/thick_baseline_table.md
```

Output: Markdown table with `success@budget (mean±std)` and `steps→50% (mean±std)`.

---

## 6. Repository Structure

```
D:/reaserch/
├── dreamerv3-torch/              # Main method (DreamerV3 + exploitation analysis)
│   ├── dreamer.py                # Entry point: training loop, make_env()
│   ├── models.py                 # WorldModel, ImagBehavior, exploitation metrics
│   ├── networks.py               # RSSM, MultiEncoder, MLP layers
│   ├── configs.yaml              # Hyperparameter profiles (incl. antmaze)
│   ├── envs/
│   │   ├── antmaze.py            # *** AntMaze wrapper (our addition) ***
│   │   ├── dmc.py                # DeepMind Control Suite
│   │   └── wrappers.py           # TimeLimit, NormalizeAction, etc.
│   └── run_antmaze.sh            # Convenience launcher
│
├── THICK/                        # Baseline (THICK, ICLR 2024)
│   ├── thick/
│   │   ├── train.py              # Entry point
│   │   ├── agent.py              # THICK agent (DreamerV2 + context)
│   │   ├── configs.yaml          # Profiles (incl. antmaze, crafter)
│   │   └── common/
│   │       ├── envs.py           # *** AntMaze class (our addition) ***
│   │       └── env_register.py   # *** antmaze suite routing (our addition) ***
│   ├── run_baseline.sh           # Sequential 5-seed runner
│   ├── run_baseline_parallel.sh  # Parallel variant
│   ├── eval_baseline.py          # Metrics aggregation → paper table
│   └── requirements_cuda12.txt   # TF 2.14 + CUDA 12 deps
│
├── logdir/                       # All TensorBoard logs (gitignored)
│   ├── antmaze_medium/seed{0-4}/
│   └── thick_baseline/
│       ├── thick_antmaze_medium/seed{0-4}/
│       ├── thick_antmaze_large/seed{0-4}/
│       ├── thick_minihack/seed{0-4}/
│       └── thick_crafter/seed{0-4}/
│
├── results/                      # Paper-ready outputs
│   └── thick_baseline_table.md
│
└── RESEARCH_PIPELINE.md          # This file
```

---

## 7. Software Stack

### Conda Environment: `dreamer`

| Package | Version | Role |
|---|---|---|
| Python | 3.10 | |
| PyTorch | 2.5.1+cu121 | DreamerV3 training |
| TensorFlow | 2.14.1 (and-cuda) | THICK training |
| CUDA | 12.9 (driver) / 12.1 (torch) | GPU compute |
| MuJoCo | 3.5.0 | Physics simulation |
| gymnasium | 1.2.3 | Env API |
| gymnasium-robotics | 1.4.2 | AntMaze environments |
| gym | 0.22.0 | Legacy API (THICK compat) |
| numpy | 1.23.5 | |
| tensorboard | latest | Logging + monitoring |

### Installation

```bash
# Create env
conda create -n dreamer python=3.10
conda activate dreamer

# PyTorch (CUDA 12.1 wheels)
pip install torch==2.5.1+cu121 torchvision --index-url https://download.pytorch.org/whl/cu121

# MuJoCo + environments
pip install mujoco==3.5.0
pip install gymnasium==1.2.3 gymnasium-robotics==1.4.2
pip install gym==0.22.0

# TensorFlow for THICK (CUDA 12 compatible)
pip install -r D:/reaserch/THICK/requirements_cuda12.txt

# Other
pip install tensorboard ruamel.yaml opencv-python einops
```

### Hardware

| Resource | Spec |
|---|---|
| GPU | NVIDIA GTX 1050, 3GB VRAM |
| OS | Windows 10 Pro (MINGW64 / Git Bash) |
| CUDA driver | 12.9 |

**VRAM budget:**
- DreamerV3 antmaze config: ~2.4GB (reduced `dyn_hidden=256`, `batch_size=16`)
- THICK antmaze: ~2.6GB (default THICK nets, batch=16, length=50)

---

## 8. Experimental Protocol

### 8.1 Seeds & Reproducibility

- All experiments: **5 seeds** (0, 1, 2, 3, 4)
- Seed controls: Python `random`, NumPy, PyTorch/TF global seeds, gymnasium env seed
- Set via `--seed N` in both codebases

### 8.2 Evaluation Protocol

**DreamerV3:**
- Eval every `eval_every=10k` environment steps
- `eval_episode_num=10` episodes per eval
- Metric: mean episode return across 10 eval episodes

**THICK:**
- Eval every `eval_every=1e5` steps
- `eval_eps=1` (single eval episode per checkpoint)
- Success logged as `log_success` in TensorBoard

### 8.3 Data Collection for Paper

```bash
# Step 1: Run all THICK baselines
bash D:/reaserch/THICK/run_baseline.sh \
  /c/Users/user/anaconda3/envs/dreamer/python.exe \
  D:/reaserch/logdir/thick_baseline

# Step 2: Run DreamerV3 on AntMaze (5 seeds)
bash D:/reaserch/run_antmaze.sh

# Step 3: Generate THICK results table
python D:/reaserch/THICK/eval_baseline.py \
  --logdir D:/reaserch/logdir/thick_baseline \
  --out D:/reaserch/results/thick_baseline_table.md

# Step 4: Plot exploitation gap curves
# (TensorBoard → export CSV → matplotlib)
tensorboard --logdir D:/reaserch/logdir/antmaze_medium --port 6006
# Navigate to imag_reward_mean, env_reward_mean → download CSV
```

---

## 9. Key Findings (Running Notes)

> Update this section as experiments complete.

### Expected / Hypothesized

- **AntMaze sparse** → exploitation gap emerges within 100-200k steps
- **KL drop** precedes exploitation gap growth by ~50k steps (leading indicator)
- **THICK** may show smaller gap due to context gating (sparse context updates = less overfitting)
- **Dense AntMaze** (ablation) → no exploitation gap (reward signal rich enough to constrain model)

### Observed

*(Fill in after runs complete)*

---

## 10. Paper Outline (Draft)

1. **Introduction** — world model agents + sparse reward = uncharted exploitation territory
2. **Background** — DreamerV3 RSSM, imagination rollouts, KL regularization
3. **Exploitation Gap Formalization** — define gap, theoretical conditions for it to grow
4. **Experimental Setup** — AntMaze variants, THICK baseline, metrics
5. **Results**
   - 5a. Exploitation gap dynamics over training
   - 5b. KL as leading indicator
   - 5c. THICK vs DreamerV3 gap comparison
   - 5d. Dense vs sparse reward ablation
6. **Analysis** — which model components drive exploitation
7. **Mitigation** — (if applicable) e.g. KL annealing, reward uncertainty bonus
8. **Conclusion**

---

## 11. Gotchas & Known Issues

| Issue | Cause | Fix |
|---|---|---|
| `MUJOCO_GL` crash on Windows | MuJoCo tries EGL/GLX | Set `MUJOCO_GL=osmesa` in `dreamer.py` |
| TF 2.4 CUDA incompatible | Original THICK reqs | Use `requirements_cuda12.txt` (TF 2.14) |
| `conda run -n dreamer python -c "..."` fails | Multiline args in Git Bash | Write commands to `.py` file, run that |
| `eval_return≈0` throughout | AntMaze sparse, hard exploration | Expected — track `imag_reward_mean` instead |
| AntMaze v5 obs dim = 107 | v5 changed from 29-dim (v4) | Verified: 105 ant + 2 goal = 107 |
| Low GPU util during THICK | TF graph compilation on first batch | Normal — wait ~5min for warmup |
| `log_success` not found in TB | THICK uses `log_keys_sum` | Check `log_every` config; may need `log_keys_sum: 'log_success'` |