Qwen3-8B-Illegible-DPO (LoRA Adapter)

A LoRA adapter fine-tuned on Qwen/Qwen3-8B using Direct Preference Optimization (DPO) to train the model to prefer legible chain-of-thought (CoT) reasoning over illegible (corrupted) reasoning traces.

This is a safety research artifact for studying whether models can be trained to produce reasoning that is harder for external observers to exploit, while maintaining task performance.

Model Description

Base model: Qwen/Qwen3-8B
Fine-tuning method: DPO with LoRA
Training platform: Together AI
Checkpoint: Step 1448 / 2412 (epoch 0.60)
Purpose: Safety research -- studying CoT legibility and information leakage in reasoning traces

Intended Use

This adapter is a research artifact for studying chain-of-thought legibility. It is intended for:

Evaluating whether DPO can reduce how much information external "reader" models can extract from reasoning traces
Comparing reader accuracy on CoTs from this model vs. baseline generators
Measuring the trade-off between CoT illegibility and task performance

Not intended for: Production deployment, general-purpose chat, or any application where reasoning transparency is important.

Training Details

Training Data

~20,300 DPO pairs constructed from chain-of-thought traces of three reasoning models (DeepSeek-R1-Distill-Qwen-32B, DeepSeek-R1-0528, QwQ-32B) on three benchmarks:

GPQA Diamond -- graduate-level science questions (multiple choice)
MATH (Levels 3-5) -- competition mathematics (open-ended)
BIG-Bench Hard -- diverse reasoning tasks (6 subtasks, multiple choice)

Each pair consists of:

Preferred output: The original, coherent CoT wrapped in <think> tags, followed by the correct answer
Non-preferred output: A corrupted version of the same CoT (same correct answer), made illegible via one of 9 off-policy transforms

Off-Policy Transforms

Transform	Description	Signal Strength
Cross-Substitution	Replace entire CoT with reasoning from a different problem	Strong
Donor Splice (50%)	Keep first half, splice second half from a different problem	Moderate-Strong
Donor Splice (25%)	Keep first quarter, splice remainder from a different problem	Strong
Sentence Shuffle	Randomly permute all sentences	Moderate
Adjacent Swap	Swap adjacent sentence pairs	Moderate
Sentence Deletion (30%)	Randomly drop 30% of sentences	Moderate
Distractor Injection	Interleave sentences from a different problem's CoT	Moderate-Strong
Number Substitution	Replace numbers with values from a different problem	Weak-Moderate
Sentence Reversal	Reverse the order of all sentences	Moderate

Training Hyperparameters

Parameter	Value
Training method	DPO
DPO beta	0.3
RPO alpha	0.5
Normalize log ratios	true
Learning rate	5e-6
LR scheduler	Cosine (0.5 cycles)
Warmup ratio	0.05
Batch size	8
Epochs	1 (checkpoint at 0.60 epochs)
Max gradient norm	1.0
Weight decay	0.0
Seed	42
Max sequence length	40,960 tokens
Sequence packing	false

LoRA Configuration

Parameter	Value
LoRA rank (r)	16
LoRA alpha	16
LoRA dropout	0.0
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable modules	all-linear
PEFT type	LoRA
Task type	CAUSAL_LM

Training Metrics

Training progressed from initial DPO loss of 0.691 to 0.002, with reward accuracy reaching 100%.

Step	Train Loss	Eval Loss	Reward Margin	Reward Accuracy
1	0.6914	--	0.00	0.0%
484	~0.02	0.0151	8.60	99.9%
966	~0.007	0.0070	13.06	100.0%
1448	0.0021	0.0056	14.20	100.0%

W&B run: dipika-khullar/encoded-reasoning-rl/dro0ufu7

Training Notes

6 out of 19,289 training samples were truncated to the 40,960 token max sequence length (0.03%)
2 out of 1,015 validation samples were truncated (0.20%)
This checkpoint is an intermediate adapter at step 1448 of a 2412-step run (60% through training)
Base model parameter count: 8.19B

How to Use

With PEFT

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B")
model = PeftModel.from_pretrained(base_model, "JackHopkinsGSL/Qwen3-8B-Illegible-DPO")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

messages = [
    {"role": "user", "content": "What is the integral of x^2 from 0 to 5?\n\nAfter your reasoning, state your final answer inside \\boxed{} notation."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With Together AI (serverless)

Once the full training run completes, the model will be available on Together AI:

from together import Together
client = Together()

response = client.chat.completions.create(
    model="fellows_safety/Qwen3-8B-ft-dpo-legibility-offpolicy-8b-74cbfd61",
    messages=[{"role": "user", "content": "Your question here"}],
    max_tokens=4096,
)

Evaluation

Evaluation is ongoing. The key metrics of interest are:

Task accuracy -- Does the DPO model still solve GPQA/MATH/BBH correctly? (Comparing against base Qwen3-8B)
Reader accuracy -- Can external reader models extract the correct answer from this model's CoTs? (Lower = more illegible)
Legibility classification -- How do CoTs classify under the ANSWER_LEAKED / REASONING_LEGIBLE / ILLEGIBLE taxonomy?

Limitations

This is an intermediate checkpoint (step 1448/2412). The final checkpoint may have different properties.
The DPO reward accuracy reached 100% early, which may indicate the corruptions were too easy to distinguish from originals. More subtle transforms or harder negatives might be needed for stronger illegibility training.
The model was trained on CoTs from specific reasoning models (DeepSeek-R1, QwQ-32B), so illegibility effects may not transfer to CoTs generated by other model families.
This is a LoRA adapter -- it must be used with the Qwen/Qwen3-8B base model.

Citation

If you use this model in your research, please cite:

@misc{qwen3-8b-illegible-dpo-2026,
  title={Qwen3-8B-Illegible-DPO: Training Illegible Chain-of-Thought via DPO},
  author={Safety Research Fellows},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/JackHopkinsGSL/Qwen3-8B-Illegible-DPO}
}

Framework Versions

PEFT: 0.15.1
Together AI SDK: v2
Training infrastructure: Together AI

Downloads last month: 2

Model tree for JackHopkinsGSL/Qwen3-8B-Illegible-DPO

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Adapter

(1464)

this model