--- license: apache-2.0 language: en tags: - draft-refine - block-diffusion - nce - qwen3-4b --- # Qwen3-4B Stage-3 NCE — beam-perround variant Stage-3 Noise-Contrastive-Estimation training resumed from the Stage-2 end ckpt at step 7335. NCE phase trains the scorer head to rank K=4 candidate completions per block, with beam-bayes proposal sampling and per-round score combination. ## Files | File | Size | |---|---| | `model.pt` | 8.5 GB | | `optimizer.pt` | 0.45 GB | | `scheduler.pt` | 1.7 KB | | `eval_batches.pt` | 13 MB | | `rng_rank{0..23}.pt` | 14.7 KB each | Total: ~9.54 GB / 30 files. Full resume state for re-training. ## Step + lineage - Resume from: Stage-2 ckpt at step 7335 (4B Qwen3-Base CPT pipeline) - This ckpt: step 8706 (1371 steps of NCE training) - Backbone: Qwen3-4B-Base (frozen during NCE phase) - Training script: `scripts/train_unified.py` - Config: `configs/large_scale/qwen3_4b_stage3_nce_resume7335_beam_perround_6n.yaml` ## Eval (Stage-3 BoN, K=4, R=1, α=0.5, beam_bayes argmax) | benchmark | acc | |---|---:| | GSM8K (1319q) | 79.83% (1053/1319) | | MATH-500 (500q) | 48.00% (240/500) | | HumanEval (164q) | 56.10% (92/164) | | GPQA-diamond (193q) | 36.27% (70/193) | Compared to Stage-2 baseline at step 7335 (greedy uncommitted_soft): | | Stage-3 NCE | Stage-2 baseline | Δ | |---|---:|---:|---:| | GSM8K | 79.83% | 82.64% | −2.81pp | | MATH-500 | 48.00% | 51.00% | −3.00pp | | HumanEval | 56.10% | 60.98% | −4.88pp | | GPQA | 36.27% | 36.27% | 0 | The scorer-rerank is currently neutral-to-slightly-negative — see notes on `score_scale` plateau in the project README. A learnable-scale re-train (`-b2fix` variant) is queued. ## Loading ```python from draft_refine.training.checkpointing import load_full_state ckpt = load_full_state("./checkpoint-00008706-20260501_081812/") # ckpt.model contains the DiffusionLM with scorer head attached ``` ## Related archives - `haotiansun014/qwen3-4b-stage3-nce-7335-lastround-archive` — same training recipe but with `combine=lastround` (final-round score, not per-round). - `haotiansun014/qwen3-4b-stage3-nce-7335-temp07-archive` — softmax_sampling proposal at T=0.7 (vs argmax in beam-perround).