Reinforcement Learning via Self-Distillation
Paper • 2601.20802 • Published • 50
qwen3-8b-biology-1h is the ~1 hour wall-clock checkpoint of Qwen/Qwen3-8B trained on biology with an SDPO-style self-distillation pipeline.
This model follows the SDPO method from:
step_10safetensorswambosec/qwen3-8b-biology-1hQwen/Qwen3-8Bsciknoweval/biology (train split)k=100) + tail bucketuv run sdft @ configs/sdft/generalization.toml \
--trainer.data.dataset_name=../SDPO/datasets/sciknoweval/biology \
--trainer.ckpt.interval=10 \
--trainer.ckpt.keep-last=1 \
--trainer.ckpt.weights.save-format=safetensors \
--trainer.ckpt.weights.save-sharded
## Intended Use
Research checkpoint for:
- early training-dynamics analysis,
- biology-domain probing,
- continuation finetuning.
## Limitations
- This is an intermediate checkpoint, not a final converged model.
- No full safety/alignment evaluation is claimed here.
- Metrics are not reported as a final benchmark release.
## Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
repo = "wambosec/qwen3-8b-biology-1h"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
repo,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
## Citation
If you use this checkpoint, please cite SDPO:
- https://arxiv.org/abs/2601.20802v1