Instructions to use JackHopkinsGSL/Qwen3-8B-Illegible-DPO with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use JackHopkinsGSL/Qwen3-8B-Illegible-DPO with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B") model = PeftModel.from_pretrained(base_model, "JackHopkinsGSL/Qwen3-8B-Illegible-DPO") - Notebooks
- Google Colab
- Kaggle
Qwen3-8B-Illegible-DPO (LoRA Adapter)
A LoRA adapter fine-tuned on Qwen/Qwen3-8B using Direct Preference Optimization (DPO) to train the model to prefer legible chain-of-thought (CoT) reasoning over illegible (corrupted) reasoning traces.
This is a safety research artifact for studying whether models can be trained to produce reasoning that is harder for external observers to exploit, while maintaining task performance.
Model Description
- Base model: Qwen/Qwen3-8B
- Fine-tuning method: DPO with LoRA
- Training platform: Together AI
- Checkpoint: Step 1448 / 2412 (epoch 0.60)
- Purpose: Safety research -- studying CoT legibility and information leakage in reasoning traces
Intended Use
This adapter is a research artifact for studying chain-of-thought legibility. It is intended for:
- Evaluating whether DPO can reduce how much information external "reader" models can extract from reasoning traces
- Comparing reader accuracy on CoTs from this model vs. baseline generators
- Measuring the trade-off between CoT illegibility and task performance
Not intended for: Production deployment, general-purpose chat, or any application where reasoning transparency is important.
Training Details
Training Data
~20,300 DPO pairs constructed from chain-of-thought traces of three reasoning models (DeepSeek-R1-Distill-Qwen-32B, DeepSeek-R1-0528, QwQ-32B) on three benchmarks:
- GPQA Diamond -- graduate-level science questions (multiple choice)
- MATH (Levels 3-5) -- competition mathematics (open-ended)
- BIG-Bench Hard -- diverse reasoning tasks (6 subtasks, multiple choice)
Each pair consists of:
- Preferred output: The original, coherent CoT wrapped in
<think>tags, followed by the correct answer - Non-preferred output: A corrupted version of the same CoT (same correct answer), made illegible via one of 9 off-policy transforms
Off-Policy Transforms
| Transform | Description | Signal Strength |
|---|---|---|
| Cross-Substitution | Replace entire CoT with reasoning from a different problem | Strong |
| Donor Splice (50%) | Keep first half, splice second half from a different problem | Moderate-Strong |
| Donor Splice (25%) | Keep first quarter, splice remainder from a different problem | Strong |
| Sentence Shuffle | Randomly permute all sentences | Moderate |
| Adjacent Swap | Swap adjacent sentence pairs | Moderate |
| Sentence Deletion (30%) | Randomly drop 30% of sentences | Moderate |
| Distractor Injection | Interleave sentences from a different problem's CoT | Moderate-Strong |
| Number Substitution | Replace numbers with values from a different problem | Weak-Moderate |
| Sentence Reversal | Reverse the order of all sentences | Moderate |
Training Hyperparameters
| Parameter | Value |
|---|---|
| Training method | DPO |
| DPO beta | 0.3 |
| RPO alpha | 0.5 |
| Normalize log ratios | true |
| Learning rate | 5e-6 |
| LR scheduler | Cosine (0.5 cycles) |
| Warmup ratio | 0.05 |
| Batch size | 8 |
| Epochs | 1 (checkpoint at 0.60 epochs) |
| Max gradient norm | 1.0 |
| Weight decay | 0.0 |
| Seed | 42 |
| Max sequence length | 40,960 tokens |
| Sequence packing | false |
LoRA Configuration
| Parameter | Value |
|---|---|
| LoRA rank (r) | 16 |
| LoRA alpha | 16 |
| LoRA dropout | 0.0 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable modules | all-linear |
| PEFT type | LoRA |
| Task type | CAUSAL_LM |
Training Metrics
Training progressed from initial DPO loss of 0.691 to 0.002, with reward accuracy reaching 100%.
| Step | Train Loss | Eval Loss | Reward Margin | Reward Accuracy |
|---|---|---|---|---|
| 1 | 0.6914 | -- | 0.00 | 0.0% |
| 484 | ~0.02 | 0.0151 | 8.60 | 99.9% |
| 966 | ~0.007 | 0.0070 | 13.06 | 100.0% |
| 1448 | 0.0021 | 0.0056 | 14.20 | 100.0% |
W&B run: dipika-khullar/encoded-reasoning-rl/dro0ufu7
Training Notes
- 6 out of 19,289 training samples were truncated to the 40,960 token max sequence length (0.03%)
- 2 out of 1,015 validation samples were truncated (0.20%)
- This checkpoint is an intermediate adapter at step 1448 of a 2412-step run (60% through training)
- Base model parameter count: 8.19B
How to Use
With PEFT
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B")
model = PeftModel.from_pretrained(base_model, "JackHopkinsGSL/Qwen3-8B-Illegible-DPO")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
messages = [
{"role": "user", "content": "What is the integral of x^2 from 0 to 5?\n\nAfter your reasoning, state your final answer inside \\boxed{} notation."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
With Together AI (serverless)
Once the full training run completes, the model will be available on Together AI:
from together import Together
client = Together()
response = client.chat.completions.create(
model="fellows_safety/Qwen3-8B-ft-dpo-legibility-offpolicy-8b-74cbfd61",
messages=[{"role": "user", "content": "Your question here"}],
max_tokens=4096,
)
Evaluation
Evaluation is ongoing. The key metrics of interest are:
- Task accuracy -- Does the DPO model still solve GPQA/MATH/BBH correctly? (Comparing against base Qwen3-8B)
- Reader accuracy -- Can external reader models extract the correct answer from this model's CoTs? (Lower = more illegible)
- Legibility classification -- How do CoTs classify under the ANSWER_LEAKED / REASONING_LEGIBLE / ILLEGIBLE taxonomy?
Limitations
- This is an intermediate checkpoint (step 1448/2412). The final checkpoint may have different properties.
- The DPO reward accuracy reached 100% early, which may indicate the corruptions were too easy to distinguish from originals. More subtle transforms or harder negatives might be needed for stronger illegibility training.
- The model was trained on CoTs from specific reasoning models (DeepSeek-R1, QwQ-32B), so illegibility effects may not transfer to CoTs generated by other model families.
- This is a LoRA adapter -- it must be used with the Qwen/Qwen3-8B base model.
Citation
If you use this model in your research, please cite:
@misc{qwen3-8b-illegible-dpo-2026,
title={Qwen3-8B-Illegible-DPO: Training Illegible Chain-of-Thought via DPO},
author={Safety Research Fellows},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/JackHopkinsGSL/Qwen3-8B-Illegible-DPO}
}
Framework Versions
- PEFT: 0.15.1
- Together AI SDK: v2
- Training infrastructure: Together AI
- Downloads last month
- 2