Qwen3-8B-Illegible-DPO (LoRA Adapter)

A LoRA adapter fine-tuned on Qwen/Qwen3-8B using Direct Preference Optimization (DPO) to train the model to prefer legible chain-of-thought (CoT) reasoning over illegible (corrupted) reasoning traces.

This is a safety research artifact for studying whether models can be trained to produce reasoning that is harder for external observers to exploit, while maintaining task performance.

Model Description

  • Base model: Qwen/Qwen3-8B
  • Fine-tuning method: DPO with LoRA
  • Training platform: Together AI
  • Checkpoint: Step 1448 / 2412 (epoch 0.60)
  • Purpose: Safety research -- studying CoT legibility and information leakage in reasoning traces

Intended Use

This adapter is a research artifact for studying chain-of-thought legibility. It is intended for:

  • Evaluating whether DPO can reduce how much information external "reader" models can extract from reasoning traces
  • Comparing reader accuracy on CoTs from this model vs. baseline generators
  • Measuring the trade-off between CoT illegibility and task performance

Not intended for: Production deployment, general-purpose chat, or any application where reasoning transparency is important.

Training Details

Training Data

~20,300 DPO pairs constructed from chain-of-thought traces of three reasoning models (DeepSeek-R1-Distill-Qwen-32B, DeepSeek-R1-0528, QwQ-32B) on three benchmarks:

  • GPQA Diamond -- graduate-level science questions (multiple choice)
  • MATH (Levels 3-5) -- competition mathematics (open-ended)
  • BIG-Bench Hard -- diverse reasoning tasks (6 subtasks, multiple choice)

Each pair consists of:

  • Preferred output: The original, coherent CoT wrapped in <think> tags, followed by the correct answer
  • Non-preferred output: A corrupted version of the same CoT (same correct answer), made illegible via one of 9 off-policy transforms

Off-Policy Transforms

Transform Description Signal Strength
Cross-Substitution Replace entire CoT with reasoning from a different problem Strong
Donor Splice (50%) Keep first half, splice second half from a different problem Moderate-Strong
Donor Splice (25%) Keep first quarter, splice remainder from a different problem Strong
Sentence Shuffle Randomly permute all sentences Moderate
Adjacent Swap Swap adjacent sentence pairs Moderate
Sentence Deletion (30%) Randomly drop 30% of sentences Moderate
Distractor Injection Interleave sentences from a different problem's CoT Moderate-Strong
Number Substitution Replace numbers with values from a different problem Weak-Moderate
Sentence Reversal Reverse the order of all sentences Moderate

Training Hyperparameters

Parameter Value
Training method DPO
DPO beta 0.3
RPO alpha 0.5
Normalize log ratios true
Learning rate 5e-6
LR scheduler Cosine (0.5 cycles)
Warmup ratio 0.05
Batch size 8
Epochs 1 (checkpoint at 0.60 epochs)
Max gradient norm 1.0
Weight decay 0.0
Seed 42
Max sequence length 40,960 tokens
Sequence packing false

LoRA Configuration

Parameter Value
LoRA rank (r) 16
LoRA alpha 16
LoRA dropout 0.0
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable modules all-linear
PEFT type LoRA
Task type CAUSAL_LM

Training Metrics

Training progressed from initial DPO loss of 0.691 to 0.002, with reward accuracy reaching 100%.

Step Train Loss Eval Loss Reward Margin Reward Accuracy
1 0.6914 -- 0.00 0.0%
484 ~0.02 0.0151 8.60 99.9%
966 ~0.007 0.0070 13.06 100.0%
1448 0.0021 0.0056 14.20 100.0%

W&B run: dipika-khullar/encoded-reasoning-rl/dro0ufu7

Training Notes

  • 6 out of 19,289 training samples were truncated to the 40,960 token max sequence length (0.03%)
  • 2 out of 1,015 validation samples were truncated (0.20%)
  • This checkpoint is an intermediate adapter at step 1448 of a 2412-step run (60% through training)
  • Base model parameter count: 8.19B

How to Use

With PEFT

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B")
model = PeftModel.from_pretrained(base_model, "JackHopkinsGSL/Qwen3-8B-Illegible-DPO")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

messages = [
    {"role": "user", "content": "What is the integral of x^2 from 0 to 5?\n\nAfter your reasoning, state your final answer inside \\boxed{} notation."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With Together AI (serverless)

Once the full training run completes, the model will be available on Together AI:

from together import Together
client = Together()

response = client.chat.completions.create(
    model="fellows_safety/Qwen3-8B-ft-dpo-legibility-offpolicy-8b-74cbfd61",
    messages=[{"role": "user", "content": "Your question here"}],
    max_tokens=4096,
)

Evaluation

Evaluation is ongoing. The key metrics of interest are:

  1. Task accuracy -- Does the DPO model still solve GPQA/MATH/BBH correctly? (Comparing against base Qwen3-8B)
  2. Reader accuracy -- Can external reader models extract the correct answer from this model's CoTs? (Lower = more illegible)
  3. Legibility classification -- How do CoTs classify under the ANSWER_LEAKED / REASONING_LEGIBLE / ILLEGIBLE taxonomy?

Limitations

  • This is an intermediate checkpoint (step 1448/2412). The final checkpoint may have different properties.
  • The DPO reward accuracy reached 100% early, which may indicate the corruptions were too easy to distinguish from originals. More subtle transforms or harder negatives might be needed for stronger illegibility training.
  • The model was trained on CoTs from specific reasoning models (DeepSeek-R1, QwQ-32B), so illegibility effects may not transfer to CoTs generated by other model families.
  • This is a LoRA adapter -- it must be used with the Qwen/Qwen3-8B base model.

Citation

If you use this model in your research, please cite:

@misc{qwen3-8b-illegible-dpo-2026,
  title={Qwen3-8B-Illegible-DPO: Training Illegible Chain-of-Thought via DPO},
  author={Safety Research Fellows},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/JackHopkinsGSL/Qwen3-8B-Illegible-DPO}
}

Framework Versions

  • PEFT: 0.15.1
  • Together AI SDK: v2
  • Training infrastructure: Together AI
Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JackHopkinsGSL/Qwen3-8B-Illegible-DPO

Finetuned
Qwen/Qwen3-8B
Adapter
(1464)
this model