haotiansun014's picture
Add README
2f442c2 verified
|
Raw
History Blame Contribute Delete
2.22 kB
metadata
license: apache-2.0
language: en
tags:
  - draft-refine
  - block-diffusion
  - nce
  - qwen3-4b

Qwen3-4B Stage-3 NCE — beam-perround variant

Stage-3 Noise-Contrastive-Estimation training resumed from the Stage-2 end ckpt at step 7335. NCE phase trains the scorer head to rank K=4 candidate completions per block, with beam-bayes proposal sampling and per-round score combination.

Files

File Size
model.pt 8.5 GB
optimizer.pt 0.45 GB
scheduler.pt 1.7 KB
eval_batches.pt 13 MB
rng_rank{0..23}.pt 14.7 KB each

Total: ~9.54 GB / 30 files. Full resume state for re-training.

Step + lineage

  • Resume from: Stage-2 ckpt at step 7335 (4B Qwen3-Base CPT pipeline)
  • This ckpt: step 8706 (1371 steps of NCE training)
  • Backbone: Qwen3-4B-Base (frozen during NCE phase)
  • Training script: scripts/train_unified.py
  • Config: configs/large_scale/qwen3_4b_stage3_nce_resume7335_beam_perround_6n.yaml

Eval (Stage-3 BoN, K=4, R=1, α=0.5, beam_bayes argmax)

benchmark acc
GSM8K (1319q) 79.83% (1053/1319)
MATH-500 (500q) 48.00% (240/500)
HumanEval (164q) 56.10% (92/164)
GPQA-diamond (193q) 36.27% (70/193)

Compared to Stage-2 baseline at step 7335 (greedy uncommitted_soft):

Stage-3 NCE Stage-2 baseline Δ
GSM8K 79.83% 82.64% −2.81pp
MATH-500 48.00% 51.00% −3.00pp
HumanEval 56.10% 60.98% −4.88pp
GPQA 36.27% 36.27% 0

The scorer-rerank is currently neutral-to-slightly-negative — see notes on score_scale plateau in the project README. A learnable-scale re-train (-b2fix variant) is queued.

Loading

from draft_refine.training.checkpointing import load_full_state
ckpt = load_full_state("./checkpoint-00008706-20260501_081812/")
# ckpt.model contains the DiffusionLM with scorer head attached

Related archives

  • haotiansun014/qwen3-4b-stage3-nce-7335-lastround-archive — same training recipe but with combine=lastround (final-round score, not per-round).
  • haotiansun014/qwen3-4b-stage3-nce-7335-temp07-archive — softmax_sampling proposal at T=0.7 (vs argmax in beam-perround).