Tenacious-Critic-Qwen3-1.7B-SimPO-seed11

A LoRA SimPO preference critic, trained for 35.9 minutes on a free Colab T4, and honestly underperforming on the held-out partition of the bench it was trained for.

Field	Value
Repo (planned)	`nahdes/tenacious-critic-qwen3-1-7b-simpo-seed11`
Backbone	`unsloth/Qwen3-1.7B`
Adapter	LoRA, rank 16, alpha 32, 7 modules (q/k/v/o + gate/up/down)
Training method	SimPO (Meng et al., 2024) — reference-free, length-normalized
Pairs	306 (148 v2-anchored + 158 mutator-augmented)
Compute	Colab T4 (Tesla T4, 16 GB), free tier
Wall clock	35.9 min
Trainable params	17.4 M / 1.74 B (1.00 %)
Precision	fp16
Seed	11
License	CC-BY-4.0 (adapter weights)
Companion dataset	`nahdes/tenacious-bench-v0.1`

Headline numbers

Two numbers, both honest, in tension:

Metric	Value	Setting
In-distribution validation accuracy	96.9 %	31 held-out preference pairs (same construction pipeline as training)
Reward margin (chosen − rejected)	+6.496	Same 31 val pairs, step 280
Held-out task-level lift over baseline	+0.0025, 95% CI [−0.019, +0.023], p=0.40	52 sealed Tenacious-Bench v0.1 tasks, paired bootstrap, B=10000

The critic learned the labeling rubric (96.9 % pair acc, +6.50 margin). It does not add task-level lift over a same-backbone single-shot baseline at 95 % CI. Use only as a research artifact.

Deploy recommendation: NO-DEPLOY for the held-out partition.

Why was this trained?

The Tenacious-Bench v0.1 project audits five Tenacious-only failure modes that τ²-Bench retail cannot probe (signal over-claiming, tone drift, ICP misclassification, AI-maturity-gated mispitch, EAT↔EU/US scheduling). Week 10 trace evidence (probe IDs probe_20260423T191351 rows 9, 11, 21 in run_log.jsonl) characterized these as inconsistency failures — the agent gets it right most of the time but cannot tell when it's wrong. That diagnosis ruled out generation-quality (Path A) and trajectory (Path C) treatments and selected Path B: a preference-tuned critic deployed as a rejection-sampling layer.

Backbone substitution

chal.md named "Qwen 3.5" (0.8B / 2B / 4B band) as the eligible backbone. As of 2026-05-01, the Qwen 3.5 family had not been released. Substituted with unsloth/Qwen3-1.7B, the closest current open-weight match within the spec's parameter band. All other hyperparameters from the original spec are preserved.

Hyperparameters (Meng 2024 §4.3 verbatim)

Param	Value
`loss_type`	`simpo`
`beta` (β)	2.0
`simpo_gamma` (γ)	1.0
`learning_rate`	5e-6
`lr_scheduler`	cosine
`warmup_ratio`	0.1
`epochs`	8 (extended from 3 — see training history)
`total_steps`	280
`effective_batch`	8 (per_device 2 × grad_accum 4)
`max_seq_len`	2048
`max_prompt_len`	1536
`lora_r` / `lora_alpha` / `lora_dropout`	16 / 32 / 0.0
`target_modules`	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training data

306 preference pairs, all anchored to train/ partition only (zero held-out leak):

148 v2 style-guide-anchored pairs: 12 GOOD + 12 BAD hand-labeled drafts (Tenacious_Style_Guide_v2.md) scored against every train task with scoring_evaluator.py. Per task: best GOOD = chosen, worst BAD = rejected, margin ≥ 0.30 (median 0.477).
158 mutator-augmented pairs: DeepSeek V3.2 (non-Qwen family) writes one strong-prompt chosen per train task; deterministic task-aware mutator corrupts that chosen along all 5 rubric dimensions (must_not tokens, off-segment fingerprint, pricing TCV, banned phrases, wrong-TZ, stacked asks) → rejected at margin ≥ 0.30 (median 0.613).

Discarded path: organic strong-vs-weak DeepSeek synthesis. Reason: DeepSeek's safety/quality training sanitises the "weak SDR" prompt — its rejected outputs scored 0.625 median, not bad enough for clean margin separation. The mutator was the working solution. Documented in training_data/build_log.json.

Leakage policy (Li et al., 2025 four-vector audit)

Vector	Status
Generator/judge family overlap	Controlled — labeling judge is rule-based (`scoring_evaluator.py`), no LLM
Generator/rewriter base sharing	Controlled — chosens authored by human + DeepSeek V3.2; zero Qwen-family pairs
Train/test author overlap (same authors, different tasks)	Controlled — held_out partition uses same authoring modes but different task instances; SHA-256 manifest sealed
Held-out distribution drift	Partial control — 8-gram Jaccard < 0.30 + embedding cosine < 0.85; 0 violations

Training history

Loss curve (full curve in training/training_run_seed11.json):

Step	Train loss	Val loss	Reward margin	Reward acc
20	1.196	1.481	0.034	0.542
105 (3 ep)	weak	—	−0.06	0.542
200	0.060	0.220	6.170	0.969
220	0.008	0.197	6.349	0.969
280 (final)	0.107	0.187	6.496	0.969

Initial 3-epoch run produced a weak critic (margin −0.06, 54 % acc). Resumed for 5 more epochs because val loss was monotonically decreasing. Train loss bottomed at step 220 (0.008) and drifted up slightly through step 280; val loss kept improving with no overfit signature. Step 280 selected.

Held-out evaluation (52 tasks)

Three conditions, all using unsloth/Qwen3-1.7B as the base generator:

Condition	Setup	Mean ± SD	Wall/task
baseline	adapter OFF, neutral system + task `style_guide_excerpts`, k=1, T=0	0.6618 ± 0.20	19.4 s
critic_rs	adapter OFF for gen (k=4 @ T=0.7), adapter ON for ranking via length-normalized log-prob × β=2.0; top-1 selected	0.6643 ± 0.21	23.5 s
promptaug	adapter OFF, condensed v2 style guide (~900 tok) in system prompt, k=1, T=0	0.6472 ± 0.20	20.1 s

Paired-bootstrap deltas (B = 10 000, seed=11)

Comparison	Mean Δ	95 % CI	p (one-sided)	W / L / T
Δ A (critic_rs − baseline)	+0.0025	[−0.019, +0.023]	0.40	19 / 16 / 17
Δ B (critic_rs − promptaug)	+0.0171	[−0.006, +0.042]	0.08	22 / 17 / 13
Δ C (τ²-Bench retail)	informational only — reused from `week_10/seed/baseline_numbers.md` per chal.md (no re-run)	—	—	—

Per-stratum breakdown

Stratum	n	baseline	critic_rs	promptaug	Best
ADV (hand-authored adversarial)	9	0.657	0.658	0.711	promptaug (+5.4pp)
PRG (combinatorial programmatic)	22	0.703	0.732	0.687	critic_rs (+2.9pp)
SYN (multi-LLM synthesis)	10	0.831	0.802	0.763	baseline
TRC (trace-derived)	11	0.429	0.408	0.411	baseline (all weak)

The critic concentrates its lift on the PRG stratum (mechanical rubric features it was trained to recognise). Promptaug concentrates its lift on ADV (where the v2 rule list helps the base model handle edge cases). Both interventions hurt on SYN (well-formed cases the base handles cleanly already) and TRC (adversarial trace items defeat all three).

Cost-Pareto

Condition	Mean fwd passes	Cost ratio vs baseline	Quality lift abs
baseline	1.0	1.00×	0.000
critic_rs	5.0	1.21×	+0.0025
promptaug	1.0	1.04×	−0.0146

critic_rs is strictly Pareto-dominated by baseline at 95 % CI: 1.21× compute for ≈zero quality. promptaug is also dominated.

Why training transfer failed (the memo p2 finding)

Three hypotheses, ordered by trace-data support:

Candidate-pool variance bottleneck (strongest). 17 / 52 ties on Δ A. Same-generator k=4 candidates at T=0.7 produce outputs the offline rubric scores identically — rejection sampling cannot improve a homogeneous pool. The critic was trained on pairs with margin ≥ 0.30; it never saw a pool this clustered. Future small-N work should scale candidate-pool variance (temperature sweeps, model-family mixing) rather than scaling N alone.
In-distribution overfit (medium). 96.9 % val acc was on 31 pairs from the same v2-anchored + mutator-augmented distribution. Held-out bench candidates are a distribution shift the critic never trained on.
N=306 vs Meng 2024's ~5K studied band (real, but weakest signal here). Documented limitation; the SimPO paper does not characterise behaviour at this N.

This is the chal.md memo p2 "1 unresolved training failure" finding, not a project failure.

Intended use

Research artifact demonstrating a leak-controlled SimPO pipeline at small N on a B2B-sales preference task.
Educational — code, data, traces, ablation results all reproduce from seed=11.
NOT for production B2B-sales generation. The held-out lift does not clear baseline at 95 % CI on the very bench it was trained for.

Out-of-scope

General preference scoring outside Tenacious B2B-sales register.
Non-English text. Bench is English-only.
Time-zones outside {Europe/Berlin, Europe/London, America/New_York, America/Los_Angeles, Africa/Addis_Ababa} — bench coverage stops there.
Scoring outputs longer than ~300 words (training pairs were short emails).

Limitations

Limitation	Detail
Held-out null lift	Δ A 95 % CI crosses zero; do not use as a deployment gate
Single seed	Only `seed=11` evaluated; no seed-stability bound
Single backbone	Only Qwen3-1.7B; transfer to other Qwen sizes / non-Qwen families unmeasured
Single judge	Eval-tier judge is Sonnet 4.6 only; no multi-judge bias decomposition
TRC weakness	Trace-derived adversarial tasks score 0.13–0.85 across all conditions — bench harder than critic
17/52 tie rate	Candidate-pool variance is the deployment-time bottleneck

Environmental cost

Training compute: 35.9 min × 1 × Tesla T4 ≈ 25 Wh.
Held-out evaluation: ~50 min × 1 × Tesla T4 ≈ 34 Wh.
Inference per critic_rs call: ~5 forward passes at fp16 on Qwen3-1.7B.
Total project compute: well under 1 kWh.

Reproducibility

# Reproduce training (Colab T4)
jupyter notebook tenacious_critic_simpo_seed11.ipynb

# Reproduce held-out ablations (Colab T4)
python ablations/run_ablations.py --condition all --out ablations/held_out_traces.jsonl
python ablations/compute_deltas.py --traces ablations/held_out_traces.jsonl --out ablations/ablation_results.json

seed=11 everywhere. TENACIOUS_BENCH_OFFLINE=1 reproduces the labeling pipeline without API calls.

Citation

@misc{nahom2026tenacious_critic,
  author       = {Nahom, A.},
  title        = {Tenacious-Critic-Qwen3-1.7B-SimPO-seed11},
  year         = {2026},
  note         = {LoRA SimPO preference critic, 10academy TRP1 Week 11 final submission},
  howpublished = {HuggingFace Hub, \url{https://huggingface.co/nahdes/tenacious-critic-qwen3-1-7b-simpo-seed11}}
}

References:

Meng, Y., Xia, M., Chen, D. (2024). SimPO: Simple Preference Optimization with a Reference-Free Reward. NeurIPS 2024.
Li et al. (2025). Preference Leakage: A Contamination Problem in LLM-as-a-Judge.
See synthesis_memos/ for opinionated commentary on each.

Downloads last month: 15

Model tree for nahdes/tenacious-critic-qwen3-1-7b-simpo-seed11

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B