prometheus04
/

qwen3-4b-thinking-microagent

+# Instance Runbook — read this FIRST when starting on the A100 box
+> **For Claude Code:** This is your briefing when the user logs you into a fresh
+> Vast.ai A100 instance. Read this file before touching anything.
+## What we are doing
+Fine-tuning **Qwen3-4B-Thinking-2507** with LoRA on **26,627 terminal-agent
+trajectories**. Single A100-40GB. Target: beat 13% on Terminal-Bench 2.0.
+The data, scripts, and docs are already on HuggingFace under user `prometheus04`.
+This box is the GPU rental for the actual training run.
+Full context: read `docs/PROJECT_OVERVIEW.md` and `docs/HPC_PRINCIPLES.md` if
+you need it, but you usually won't — the runbook below is self-contained.
+## The user is watching
+The user wants **constant visibility** during training:
+- Live progress bar (added in `train_v2.py`)
+- Step/total, ETA, tok/s, GPU mem%, loss EMA, estimated cost
+- A regression alert if throughput drops below 5k tok/s
+You don't need to babysit beyond that. The progress callback handles it.
+## The plan, in order
+```
+1. Verify hardware              (1 min)
+2. Clone the project repo       (30 sec)
+3. Pull the dataset             (3-5 min, ~1 GB)
+4. Install training stack       (3 min)
+5. Smoke test 50 steps          (10 min)  <-- CHECKPOINT: must pass before step 6
+6. Full training (1 epoch)      (4-5 hr)
+7. Merge LoRA into base         (2 min)
+8. Upload artifacts to HF       (5 min)
+```
+## Step 1 — Verify hardware
+```bash
+nvidia-smi --query-gpu=name,memory.total,driver_version,compute_cap --format=csv
+```
+Expected:
+- name = `NVIDIA A100-SXM4-40GB` or `NVIDIA A100-PCIE-40GB`
+- memory.total ≥ 40000 MiB
+- driver_version ≥ 535
+- compute_cap = `8.0`
+If anything's wrong:
+- Wrong GPU model → tell the user to destroy and re-rent. Do not proceed.
+- driver < 535 → still works with CUDA 12.4 toolkit, but flag it.
+Also check disk:
+```bash
+df -h /workspace
+```
+Need ≥40 GB free for: model + dataset + cache + checkpoints.
+## Step 2 — Clone the project repo
+```bash
+cd /workspace
+git clone https://huggingface.co/prometheus04/qwen3-4b-thinking-microagent project
+cd project
+ls scripts/ docs/
+```
+The HF model repo holds all scripts and docs. If `git clone` is slow, the box
+has a bad network path — flag to user, but proceed.
+## Step 3 — Pull the dataset
+```bash
+pip install -q huggingface_hub
+huggingface-cli download prometheus04/microagent-train-v2 \
+  microagent_train_v2.jsonl \
+  --repo-type dataset \
+  --local-dir data
+```
+After download, verify:
+```bash
+ls -la data/microagent_train_v2.jsonl
+wc -l data/microagent_train_v2.jsonl   # should print 26627
+```
+If line count is wrong, the file is corrupted — re-download.
+## Step 4 — Install training stack
+```bash
+bash scripts/setup_a100.sh
+```
+Watch for these in the output:
+- `torch: 2.5.1+cu124` ✓
+- `cuda available: True` ✓
+- `flash_attn: 2.7.4.post1` ✓
+- `unsloth: imported OK` ✓
+- `bf16 supported: True` ✓
+Common failure: `flash-attn` install fails because torch version isn't matched
+yet (race condition on uv).
+- Fix: `pip install flash-attn==2.7.4.post1 --no-build-isolation` after torch is settled.
+Alternative failure: image already has a torch version — Unsloth might whine.
+- Fix: `pip install --upgrade --force-reinstall torch==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124`
+## Step 5 — Smoke test (MANDATORY)
+```bash
+python scripts/train_v2.py \
+  --output-dir runs/smoke \
+  --max-steps 50 \
+  --eval-frac 0.005 \
+  2>&1 | tee runs/smoke.log
+```
+This takes ~10 minutes and tokenizes the corpus on first run (~5 min, cached).
+**MUST-PASS checks** before proceeding to the real run:
+| Check | What to look for |
+|---|---|
+| Loss decreases | `loss=2.5` ish at step 10 → `loss=1.5` ish at step 50 |
+| Throughput | Live status line shows `~12-15k tok/s` after step 20 |
+| GPU memory | `mem 22-26 GB / 40 GB` (~60% utilization) |
+| No regression alert | The `!! WARNING: throughput ...` line did NOT print |
+| Final mem | Peak GPU mem reported at end is under 30 GB |
+| No NaN/Inf | No `loss=nan` or `grad_norm=inf` in any log |
+If ANY of these fail, STOP. Debug before the real run.
+Common failures and fixes:
+- `Triton kernel compilation failed` → CUDA mismatch. Re-run `setup_a100.sh`.
+- `flash_attn import error` → wrong wheel. Reinstall flash-attn for torch 2.5.1+cu124.
+- Throughput under 8k tok/s → packing got disabled. Check `packing=True` in the run log; check `attn_implementation="flash_attention_2"` in model load.
+- OOM at step 1 → drop `--max-seq-len 12288`.
+- Tokenization takes >10 min → bad disk. Tell user; consider a different instance.
+If smoke test passes: delete `runs/smoke/` to save disk before the real run:
+```bash
+rm -rf runs/smoke
+```
+## Step 6 — Full training run
+Use `tmux` so the run survives SSH disconnect:
+```bash
+tmux new -s train
+```
+Inside tmux:
+```bash
+python scripts/train_v2.py \
+  --model Qwen/Qwen3-4B-Thinking-2507 \
+  --data data/microagent_train_v2.jsonl \
+  --output-dir runs/v1 \
+  --epochs 1.0 \
+  2>&1 | tee runs/train.log
+```
+Detach with `Ctrl-B`, then `D`. Reattach later with `tmux attach -t train`.
+Expected progress output every 10 steps (this is the live status the user wants):
+```
+step   100/1664 [###....................................]   6.0% | 13.2k tok/s | mem 24.3/40GB (60%) | loss=1.842 | ETA 04:12 | $0.30
+step   110/1664 [###....................................]   6.6% | 13.1k tok/s | mem 24.3/40GB (60%) | loss=1.821 | ETA 04:10 | $0.33
+step   120/1664 [####...................................]   7.2% | 13.4k tok/s | mem 24.4/40GB (60%) | loss=1.798 | ETA 04:07 | $0.36
+```
+**Total step count is approximately 1,664** (26,627 trajectories ÷ 16 effective
+batch, with packing fitting ~1 trajectory per sequence on average).
+What to monitor:
+- Throughput stays steady around 12-15k tok/s
+- Loss is monotonically decreasing (smooth trend, not step-by-step)
+- GPU memory stays around 24-28 GB
+- ETA decreases by roughly 1 hour every hour ✓
+- Cost estimate grows linearly with elapsed time
+**Bail-out conditions** (tell the user and stop):
+- Throughput drops below 5k tok/s and stays there for 3 consecutive logs
+- Loss diverges (rising for 5+ consecutive logs)
+- GPU memory hits >95% repeatedly
+- The regression-alert warning prints
+The training script saves a checkpoint every 200 steps to `runs/v1/checkpoint-XXX`.
+If the run dies, re-running the same command resumes from the latest checkpoint
+automatically.
+## Step 7 — Merge LoRA into base
+After training completes:
+```bash
+python scripts/merge_lora.py \
+  --base Qwen/Qwen3-4B-Thinking-2507 \
+  --adapter runs/v1/final \
+  --out runs/v1/merged
+```
+Output: ~8 GB merged model in `runs/v1/merged/` ready for vLLM.
+## Step 8 — Upload artifacts to HF
+**Before destroying the instance**, get the artifacts off the box:
+```bash
+# Upload LoRA adapter (small, fast)
+huggingface-cli upload prometheus04/qwen3-4b-thinking-microagent \
+  runs/v1/final \
+  adapter-v1 \
+  --token $HF_TOKEN
+# Upload training log
+huggingface-cli upload prometheus04/qwen3-4b-thinking-microagent \
+  runs/train.log \
+  runs/train.log \
+  --token $HF_TOKEN
+# Optionally upload merged model (8 GB — takes 5-10 min)
+huggingface-cli upload prometheus04/qwen3-4b-thinking-microagent \
+  runs/v1/merged \
+  merged-v1 \
+  --token $HF_TOKEN
+```
+Verify in browser before telling the user it's safe to destroy the instance:
+- https://huggingface.co/prometheus04/qwen3-4b-thinking-microagent/tree/main
+## Reference card
+| Need | Command |
+|---|---|
+| Current GPU usage | `nvidia-smi` |
+| Disk free | `df -h /workspace` |
+| Reattach training | `tmux attach -t train` |
+| Tail training log | `tail -f runs/train.log` |
+| Kill the run cleanly | `tmux send-keys -t train C-c` |
+| Resume after crash | re-run the same `train_v2.py` command (auto-resumes from `runs/v1/checkpoint-*`) |
+## Decision tree if things go sideways
+```
+training not progressing?
+├── tok/s < 5k → packing/FA2 issue → check imports, fall back to --no-packing
+├── tok/s > 12k but loss not decreasing → LR too high, drop to 1e-4
+├── tok/s normal but mem > 35GB → drop --max-seq-len to 12288
+├── tokenization stalls > 10 min → disk too slow, switch instance
+├── flash_attn not importable → reinstall matching wheel
+├── unsloth import fails → reinstall: pip install "unsloth[cu124-torch250] @ git+..."
+└── checkpoint corrupt on resume → delete latest checkpoint dir, restart
+```
+## Cost guardrails
+- $0.80/hr × 5.5 hr = ~$4.40 total expected
+- If we hit $8 and still <50% through training, something is wrong — pause and investigate
+- Always destroy the instance after upload; don't leave it running
+## Key files in this repo
+| File | Purpose |
+|---|---|
+| `scripts/train_v2.py` | THE script — HPC training |
+| `scripts/setup_a100.sh` | One-shot installer |
+| `scripts/merge_lora.py` | Adapter → merged model |
+| `data/microagent_train_v2.jsonl` | 26,627 training trajectories |
+| `docs/HPC_PRINCIPLES.md` | Every optimization explained |
+| `docs/VAST_AI_SETUP.md` | Generic Vast.ai workflow |
+| `docs/INSTANCE_RUNBOOK.md` | This file (you are here) |
+## What the user wants from you on the instance
+1. **Confirm the box is good** (step 1)
+2. **Run the smoke test and report the must-pass checks** (step 5)
+3. **Start the real training run in tmux** (step 6) — user wants to see the live progress
+4. **Watch for the regression alert** during training
+5. **Merge + upload after training completes** (steps 7-8)
+6. **Confirm uploads are visible on HF before letting user destroy the instance**
+The user is paying ~$0.80/hr. Don't waste cycles. Don't re-derive things in
+this runbook from first principles — just execute.