prometheus04 commited on
Commit
f331a2f
Β·
verified Β·
1 Parent(s): 58737d9

Add INSTANCE_RUNBOOK.md for Claude-on-instance briefing

Browse files
Files changed (1) hide show
  1. docs/INSTANCE_RUNBOOK.md +287 -0
docs/INSTANCE_RUNBOOK.md ADDED
@@ -0,0 +1,287 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Instance Runbook β€” read this FIRST when starting on the A100 box
2
+
3
+ > **For Claude Code:** This is your briefing when the user logs you into a fresh
4
+ > Vast.ai A100 instance. Read this file before touching anything.
5
+
6
+ ## What we are doing
7
+
8
+ Fine-tuning **Qwen3-4B-Thinking-2507** with LoRA on **26,627 terminal-agent
9
+ trajectories**. Single A100-40GB. Target: beat 13% on Terminal-Bench 2.0.
10
+
11
+ The data, scripts, and docs are already on HuggingFace under user `prometheus04`.
12
+ This box is the GPU rental for the actual training run.
13
+
14
+ Full context: read `docs/PROJECT_OVERVIEW.md` and `docs/HPC_PRINCIPLES.md` if
15
+ you need it, but you usually won't β€” the runbook below is self-contained.
16
+
17
+ ## The user is watching
18
+
19
+ The user wants **constant visibility** during training:
20
+ - Live progress bar (added in `train_v2.py`)
21
+ - Step/total, ETA, tok/s, GPU mem%, loss EMA, estimated cost
22
+ - A regression alert if throughput drops below 5k tok/s
23
+
24
+ You don't need to babysit beyond that. The progress callback handles it.
25
+
26
+ ## The plan, in order
27
+
28
+ ```
29
+ 1. Verify hardware (1 min)
30
+ 2. Clone the project repo (30 sec)
31
+ 3. Pull the dataset (3-5 min, ~1 GB)
32
+ 4. Install training stack (3 min)
33
+ 5. Smoke test 50 steps (10 min) <-- CHECKPOINT: must pass before step 6
34
+ 6. Full training (1 epoch) (4-5 hr)
35
+ 7. Merge LoRA into base (2 min)
36
+ 8. Upload artifacts to HF (5 min)
37
+ ```
38
+
39
+ ## Step 1 β€” Verify hardware
40
+
41
+ ```bash
42
+ nvidia-smi --query-gpu=name,memory.total,driver_version,compute_cap --format=csv
43
+ ```
44
+
45
+ Expected:
46
+ - name = `NVIDIA A100-SXM4-40GB` or `NVIDIA A100-PCIE-40GB`
47
+ - memory.total β‰₯ 40000 MiB
48
+ - driver_version β‰₯ 535
49
+ - compute_cap = `8.0`
50
+
51
+ If anything's wrong:
52
+ - Wrong GPU model β†’ tell the user to destroy and re-rent. Do not proceed.
53
+ - driver < 535 β†’ still works with CUDA 12.4 toolkit, but flag it.
54
+
55
+ Also check disk:
56
+ ```bash
57
+ df -h /workspace
58
+ ```
59
+ Need β‰₯40 GB free for: model + dataset + cache + checkpoints.
60
+
61
+ ## Step 2 β€” Clone the project repo
62
+
63
+ ```bash
64
+ cd /workspace
65
+ git clone https://huggingface.co/prometheus04/qwen3-4b-thinking-microagent project
66
+ cd project
67
+ ls scripts/ docs/
68
+ ```
69
+
70
+ The HF model repo holds all scripts and docs. If `git clone` is slow, the box
71
+ has a bad network path β€” flag to user, but proceed.
72
+
73
+ ## Step 3 β€” Pull the dataset
74
+
75
+ ```bash
76
+ pip install -q huggingface_hub
77
+ huggingface-cli download prometheus04/microagent-train-v2 \
78
+ microagent_train_v2.jsonl \
79
+ --repo-type dataset \
80
+ --local-dir data
81
+ ```
82
+
83
+ After download, verify:
84
+ ```bash
85
+ ls -la data/microagent_train_v2.jsonl
86
+ wc -l data/microagent_train_v2.jsonl # should print 26627
87
+ ```
88
+
89
+ If line count is wrong, the file is corrupted β€” re-download.
90
+
91
+ ## Step 4 β€” Install training stack
92
+
93
+ ```bash
94
+ bash scripts/setup_a100.sh
95
+ ```
96
+
97
+ Watch for these in the output:
98
+ - `torch: 2.5.1+cu124` βœ“
99
+ - `cuda available: True` βœ“
100
+ - `flash_attn: 2.7.4.post1` βœ“
101
+ - `unsloth: imported OK` βœ“
102
+ - `bf16 supported: True` βœ“
103
+
104
+ Common failure: `flash-attn` install fails because torch version isn't matched
105
+ yet (race condition on uv).
106
+ - Fix: `pip install flash-attn==2.7.4.post1 --no-build-isolation` after torch is settled.
107
+
108
+ Alternative failure: image already has a torch version β€” Unsloth might whine.
109
+ - Fix: `pip install --upgrade --force-reinstall torch==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124`
110
+
111
+ ## Step 5 β€” Smoke test (MANDATORY)
112
+
113
+ ```bash
114
+ python scripts/train_v2.py \
115
+ --output-dir runs/smoke \
116
+ --max-steps 50 \
117
+ --eval-frac 0.005 \
118
+ 2>&1 | tee runs/smoke.log
119
+ ```
120
+
121
+ This takes ~10 minutes and tokenizes the corpus on first run (~5 min, cached).
122
+
123
+ **MUST-PASS checks** before proceeding to the real run:
124
+
125
+ | Check | What to look for |
126
+ |---|---|
127
+ | Loss decreases | `loss=2.5` ish at step 10 β†’ `loss=1.5` ish at step 50 |
128
+ | Throughput | Live status line shows `~12-15k tok/s` after step 20 |
129
+ | GPU memory | `mem 22-26 GB / 40 GB` (~60% utilization) |
130
+ | No regression alert | The `!! WARNING: throughput ...` line did NOT print |
131
+ | Final mem | Peak GPU mem reported at end is under 30 GB |
132
+ | No NaN/Inf | No `loss=nan` or `grad_norm=inf` in any log |
133
+
134
+ If ANY of these fail, STOP. Debug before the real run.
135
+
136
+ Common failures and fixes:
137
+ - `Triton kernel compilation failed` β†’ CUDA mismatch. Re-run `setup_a100.sh`.
138
+ - `flash_attn import error` β†’ wrong wheel. Reinstall flash-attn for torch 2.5.1+cu124.
139
+ - Throughput under 8k tok/s β†’ packing got disabled. Check `packing=True` in the run log; check `attn_implementation="flash_attention_2"` in model load.
140
+ - OOM at step 1 β†’ drop `--max-seq-len 12288`.
141
+ - Tokenization takes >10 min β†’ bad disk. Tell user; consider a different instance.
142
+
143
+ If smoke test passes: delete `runs/smoke/` to save disk before the real run:
144
+ ```bash
145
+ rm -rf runs/smoke
146
+ ```
147
+
148
+ ## Step 6 β€” Full training run
149
+
150
+ Use `tmux` so the run survives SSH disconnect:
151
+
152
+ ```bash
153
+ tmux new -s train
154
+ ```
155
+
156
+ Inside tmux:
157
+ ```bash
158
+ python scripts/train_v2.py \
159
+ --model Qwen/Qwen3-4B-Thinking-2507 \
160
+ --data data/microagent_train_v2.jsonl \
161
+ --output-dir runs/v1 \
162
+ --epochs 1.0 \
163
+ 2>&1 | tee runs/train.log
164
+ ```
165
+
166
+ Detach with `Ctrl-B`, then `D`. Reattach later with `tmux attach -t train`.
167
+
168
+ Expected progress output every 10 steps (this is the live status the user wants):
169
+ ```
170
+ step 100/1664 [###....................................] 6.0% | 13.2k tok/s | mem 24.3/40GB (60%) | loss=1.842 | ETA 04:12 | $0.30
171
+ step 110/1664 [###....................................] 6.6% | 13.1k tok/s | mem 24.3/40GB (60%) | loss=1.821 | ETA 04:10 | $0.33
172
+ step 120/1664 [####...................................] 7.2% | 13.4k tok/s | mem 24.4/40GB (60%) | loss=1.798 | ETA 04:07 | $0.36
173
+ ```
174
+
175
+ **Total step count is approximately 1,664** (26,627 trajectories Γ· 16 effective
176
+ batch, with packing fitting ~1 trajectory per sequence on average).
177
+
178
+ What to monitor:
179
+ - Throughput stays steady around 12-15k tok/s
180
+ - Loss is monotonically decreasing (smooth trend, not step-by-step)
181
+ - GPU memory stays around 24-28 GB
182
+ - ETA decreases by roughly 1 hour every hour βœ“
183
+ - Cost estimate grows linearly with elapsed time
184
+
185
+ **Bail-out conditions** (tell the user and stop):
186
+ - Throughput drops below 5k tok/s and stays there for 3 consecutive logs
187
+ - Loss diverges (rising for 5+ consecutive logs)
188
+ - GPU memory hits >95% repeatedly
189
+ - The regression-alert warning prints
190
+
191
+ The training script saves a checkpoint every 200 steps to `runs/v1/checkpoint-XXX`.
192
+ If the run dies, re-running the same command resumes from the latest checkpoint
193
+ automatically.
194
+
195
+ ## Step 7 β€” Merge LoRA into base
196
+
197
+ After training completes:
198
+
199
+ ```bash
200
+ python scripts/merge_lora.py \
201
+ --base Qwen/Qwen3-4B-Thinking-2507 \
202
+ --adapter runs/v1/final \
203
+ --out runs/v1/merged
204
+ ```
205
+
206
+ Output: ~8 GB merged model in `runs/v1/merged/` ready for vLLM.
207
+
208
+ ## Step 8 β€” Upload artifacts to HF
209
+
210
+ **Before destroying the instance**, get the artifacts off the box:
211
+
212
+ ```bash
213
+ # Upload LoRA adapter (small, fast)
214
+ huggingface-cli upload prometheus04/qwen3-4b-thinking-microagent \
215
+ runs/v1/final \
216
+ adapter-v1 \
217
+ --token $HF_TOKEN
218
+
219
+ # Upload training log
220
+ huggingface-cli upload prometheus04/qwen3-4b-thinking-microagent \
221
+ runs/train.log \
222
+ runs/train.log \
223
+ --token $HF_TOKEN
224
+
225
+ # Optionally upload merged model (8 GB β€” takes 5-10 min)
226
+ huggingface-cli upload prometheus04/qwen3-4b-thinking-microagent \
227
+ runs/v1/merged \
228
+ merged-v1 \
229
+ --token $HF_TOKEN
230
+ ```
231
+
232
+ Verify in browser before telling the user it's safe to destroy the instance:
233
+ - https://huggingface.co/prometheus04/qwen3-4b-thinking-microagent/tree/main
234
+
235
+ ## Reference card
236
+
237
+ | Need | Command |
238
+ |---|---|
239
+ | Current GPU usage | `nvidia-smi` |
240
+ | Disk free | `df -h /workspace` |
241
+ | Reattach training | `tmux attach -t train` |
242
+ | Tail training log | `tail -f runs/train.log` |
243
+ | Kill the run cleanly | `tmux send-keys -t train C-c` |
244
+ | Resume after crash | re-run the same `train_v2.py` command (auto-resumes from `runs/v1/checkpoint-*`) |
245
+
246
+ ## Decision tree if things go sideways
247
+
248
+ ```
249
+ training not progressing?
250
+ β”œβ”€β”€ tok/s < 5k β†’ packing/FA2 issue β†’ check imports, fall back to --no-packing
251
+ β”œβ”€β”€ tok/s > 12k but loss not decreasing β†’ LR too high, drop to 1e-4
252
+ β”œβ”€β”€ tok/s normal but mem > 35GB β†’ drop --max-seq-len to 12288
253
+ β”œβ”€β”€ tokenization stalls > 10 min β†’ disk too slow, switch instance
254
+ β”œβ”€β”€ flash_attn not importable β†’ reinstall matching wheel
255
+ β”œβ”€β”€ unsloth import fails β†’ reinstall: pip install "unsloth[cu124-torch250] @ git+..."
256
+ └── checkpoint corrupt on resume β†’ delete latest checkpoint dir, restart
257
+ ```
258
+
259
+ ## Cost guardrails
260
+
261
+ - $0.80/hr Γ— 5.5 hr = ~$4.40 total expected
262
+ - If we hit $8 and still <50% through training, something is wrong β€” pause and investigate
263
+ - Always destroy the instance after upload; don't leave it running
264
+
265
+ ## Key files in this repo
266
+
267
+ | File | Purpose |
268
+ |---|---|
269
+ | `scripts/train_v2.py` | THE script β€” HPC training |
270
+ | `scripts/setup_a100.sh` | One-shot installer |
271
+ | `scripts/merge_lora.py` | Adapter β†’ merged model |
272
+ | `data/microagent_train_v2.jsonl` | 26,627 training trajectories |
273
+ | `docs/HPC_PRINCIPLES.md` | Every optimization explained |
274
+ | `docs/VAST_AI_SETUP.md` | Generic Vast.ai workflow |
275
+ | `docs/INSTANCE_RUNBOOK.md` | This file (you are here) |
276
+
277
+ ## What the user wants from you on the instance
278
+
279
+ 1. **Confirm the box is good** (step 1)
280
+ 2. **Run the smoke test and report the must-pass checks** (step 5)
281
+ 3. **Start the real training run in tmux** (step 6) β€” user wants to see the live progress
282
+ 4. **Watch for the regression alert** during training
283
+ 5. **Merge + upload after training completes** (steps 7-8)
284
+ 6. **Confirm uploads are visible on HF before letting user destroy the instance**
285
+
286
+ The user is paying ~$0.80/hr. Don't waste cycles. Don't re-derive things in
287
+ this runbook from first principles β€” just execute.