haotiansun014's picture
Add README
3d81379 verified
|
Raw
History Blame Contribute Delete
1.47 kB
metadata
license: apache-2.0
language: en
tags:
  - draft-refine
  - block-diffusion
  - nce
  - qwen3-4b

Qwen3-4B Stage-3 NCE — temp07-perround variant

Stage-3 Noise-Contrastive-Estimation training resumed from the Stage-2 end ckpt at step 7335. NCE phase trains the scorer head to rank K=4 candidate completions per block. Differs from beam-perround in proposal sampling: this variant uses softmax_sampling at proposal_temperature =0.7 (vs argmax in beam-perround) — closer to training-time scorer noise.

Files

File Size
model.pt 8.5 GB
optimizer.pt 0.45 GB
scheduler.pt 1.7 KB
eval_batches.pt 13 MB
rng_rank{0..23}.pt 14.7 KB each

Total: ~9.54 GB / 30 files. Full resume state for re-training.

Step + lineage

  • Resume from: Stage-2 ckpt at step 7335
  • This ckpt: step 9124 (1789 NCE-phase steps)
  • Proposal: combine=softmax_sampling, proposal_temperature=0.7
  • Config: configs/large_scale/qwen3_4b_stage3_nce_resume7335_temp07_perround_6n.yaml

Note on inference reproducibility

When evaluating this ckpt, use:

combine=softmax_sampling proposal_temperature=0.7

to MATCH the training distribution. Using combine=beam_bayes argmax (as for the perround/lastround variants) on this ckpt produces an out-of-distribution scorer signal.

Related archives

  • haotiansun014/qwen3-4b-stage3-nce-7335-perround-archive
  • haotiansun014/qwen3-4b-stage3-nce-7335-lastround-archive