Tenacious-Critic-Qwen3-1.7B-SimPO-seed11

A LoRA SimPO preference critic, trained for 35.9 minutes on a free Colab T4, and honestly underperforming on the held-out partition of the bench it was trained for.

Field Value
Repo (planned) nahdes/tenacious-critic-qwen3-1-7b-simpo-seed11
Backbone unsloth/Qwen3-1.7B
Adapter LoRA, rank 16, alpha 32, 7 modules (q/k/v/o + gate/up/down)
Training method SimPO (Meng et al., 2024) โ€” reference-free, length-normalized
Pairs 306 (148 v2-anchored + 158 mutator-augmented)
Compute Colab T4 (Tesla T4, 16 GB), free tier
Wall clock 35.9 min
Trainable params 17.4 M / 1.74 B (1.00 %)
Precision fp16
Seed 11
License CC-BY-4.0 (adapter weights)
Companion dataset nahdes/tenacious-bench-v0.1

Headline numbers

Two numbers, both honest, in tension:

Metric Value Setting
In-distribution validation accuracy 96.9 % 31 held-out preference pairs (same construction pipeline as training)
Reward margin (chosen โˆ’ rejected) +6.496 Same 31 val pairs, step 280
Held-out task-level lift over baseline +0.0025, 95% CI [โˆ’0.019, +0.023], p=0.40 52 sealed Tenacious-Bench v0.1 tasks, paired bootstrap, B=10000

The critic learned the labeling rubric (96.9 % pair acc, +6.50 margin). It does not add task-level lift over a same-backbone single-shot baseline at 95 % CI. Use only as a research artifact.

Deploy recommendation: NO-DEPLOY for the held-out partition.


Why was this trained?

The Tenacious-Bench v0.1 project audits five Tenacious-only failure modes that ฯ„ยฒ-Bench retail cannot probe (signal over-claiming, tone drift, ICP misclassification, AI-maturity-gated mispitch, EATโ†”EU/US scheduling). Week 10 trace evidence (probe IDs probe_20260423T191351 rows 9, 11, 21 in run_log.jsonl) characterized these as inconsistency failures โ€” the agent gets it right most of the time but cannot tell when it's wrong. That diagnosis ruled out generation-quality (Path A) and trajectory (Path C) treatments and selected Path B: a preference-tuned critic deployed as a rejection-sampling layer.

Backbone substitution

chal.md named "Qwen 3.5" (0.8B / 2B / 4B band) as the eligible backbone. As of 2026-05-01, the Qwen 3.5 family had not been released. Substituted with unsloth/Qwen3-1.7B, the closest current open-weight match within the spec's parameter band. All other hyperparameters from the original spec are preserved.

Hyperparameters (Meng 2024 ยง4.3 verbatim)

Param Value
loss_type simpo
beta (ฮฒ) 2.0
simpo_gamma (ฮณ) 1.0
learning_rate 5e-6
lr_scheduler cosine
warmup_ratio 0.1
epochs 8 (extended from 3 โ€” see training history)
total_steps 280
effective_batch 8 (per_device 2 ร— grad_accum 4)
max_seq_len 2048
max_prompt_len 1536
lora_r / lora_alpha / lora_dropout 16 / 32 / 0.0
target_modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training data

306 preference pairs, all anchored to train/ partition only (zero held-out leak):

  • 148 v2 style-guide-anchored pairs: 12 GOOD + 12 BAD hand-labeled drafts (Tenacious_Style_Guide_v2.md) scored against every train task with scoring_evaluator.py. Per task: best GOOD = chosen, worst BAD = rejected, margin โ‰ฅ 0.30 (median 0.477).
  • 158 mutator-augmented pairs: DeepSeek V3.2 (non-Qwen family) writes one strong-prompt chosen per train task; deterministic task-aware mutator corrupts that chosen along all 5 rubric dimensions (must_not tokens, off-segment fingerprint, pricing TCV, banned phrases, wrong-TZ, stacked asks) โ†’ rejected at margin โ‰ฅ 0.30 (median 0.613).

Discarded path: organic strong-vs-weak DeepSeek synthesis. Reason: DeepSeek's safety/quality training sanitises the "weak SDR" prompt โ€” its rejected outputs scored 0.625 median, not bad enough for clean margin separation. The mutator was the working solution. Documented in training_data/build_log.json.

Leakage policy (Li et al., 2025 four-vector audit)

Vector Status
Generator/judge family overlap Controlled โ€” labeling judge is rule-based (scoring_evaluator.py), no LLM
Generator/rewriter base sharing Controlled โ€” chosens authored by human + DeepSeek V3.2; zero Qwen-family pairs
Train/test author overlap (same authors, different tasks) Controlled โ€” held_out partition uses same authoring modes but different task instances; SHA-256 manifest sealed
Held-out distribution drift Partial control โ€” 8-gram Jaccard < 0.30 + embedding cosine < 0.85; 0 violations

Training history

Loss curve (full curve in training/training_run_seed11.json):

Step Train loss Val loss Reward margin Reward acc
20 1.196 1.481 0.034 0.542
105 (3 ep) weak โ€” โˆ’0.06 0.542
200 0.060 0.220 6.170 0.969
220 0.008 0.197 6.349 0.969
280 (final) 0.107 0.187 6.496 0.969

Initial 3-epoch run produced a weak critic (margin โˆ’0.06, 54 % acc). Resumed for 5 more epochs because val loss was monotonically decreasing. Train loss bottomed at step 220 (0.008) and drifted up slightly through step 280; val loss kept improving with no overfit signature. Step 280 selected.

Held-out evaluation (52 tasks)

Three conditions, all using unsloth/Qwen3-1.7B as the base generator:

Condition Setup Mean ยฑ SD Wall/task
baseline adapter OFF, neutral system + task style_guide_excerpts, k=1, T=0 0.6618 ยฑ 0.20 19.4 s
critic_rs adapter OFF for gen (k=4 @ T=0.7), adapter ON for ranking via length-normalized log-prob ร— ฮฒ=2.0; top-1 selected 0.6643 ยฑ 0.21 23.5 s
promptaug adapter OFF, condensed v2 style guide (~900 tok) in system prompt, k=1, T=0 0.6472 ยฑ 0.20 20.1 s

Paired-bootstrap deltas (B = 10 000, seed=11)

Comparison Mean ฮ” 95 % CI p (one-sided) W / L / T
ฮ” A (critic_rs โˆ’ baseline) +0.0025 [โˆ’0.019, +0.023] 0.40 19 / 16 / 17
ฮ” B (critic_rs โˆ’ promptaug) +0.0171 [โˆ’0.006, +0.042] 0.08 22 / 17 / 13
ฮ” C (ฯ„ยฒ-Bench retail) informational only โ€” reused from week_10/seed/baseline_numbers.md per chal.md (no re-run) โ€” โ€” โ€”

Per-stratum breakdown

Stratum n baseline critic_rs promptaug Best
ADV (hand-authored adversarial) 9 0.657 0.658 0.711 promptaug (+5.4pp)
PRG (combinatorial programmatic) 22 0.703 0.732 0.687 critic_rs (+2.9pp)
SYN (multi-LLM synthesis) 10 0.831 0.802 0.763 baseline
TRC (trace-derived) 11 0.429 0.408 0.411 baseline (all weak)

The critic concentrates its lift on the PRG stratum (mechanical rubric features it was trained to recognise). Promptaug concentrates its lift on ADV (where the v2 rule list helps the base model handle edge cases). Both interventions hurt on SYN (well-formed cases the base handles cleanly already) and TRC (adversarial trace items defeat all three).

Cost-Pareto

Condition Mean fwd passes Cost ratio vs baseline Quality lift abs
baseline 1.0 1.00ร— 0.000
critic_rs 5.0 1.21ร— +0.0025
promptaug 1.0 1.04ร— โˆ’0.0146

critic_rs is strictly Pareto-dominated by baseline at 95 % CI: 1.21ร— compute for โ‰ˆzero quality. promptaug is also dominated.

Why training transfer failed (the memo p2 finding)

Three hypotheses, ordered by trace-data support:

  1. Candidate-pool variance bottleneck (strongest). 17 / 52 ties on ฮ” A. Same-generator k=4 candidates at T=0.7 produce outputs the offline rubric scores identically โ€” rejection sampling cannot improve a homogeneous pool. The critic was trained on pairs with margin โ‰ฅ 0.30; it never saw a pool this clustered. Future small-N work should scale candidate-pool variance (temperature sweeps, model-family mixing) rather than scaling N alone.
  2. In-distribution overfit (medium). 96.9 % val acc was on 31 pairs from the same v2-anchored + mutator-augmented distribution. Held-out bench candidates are a distribution shift the critic never trained on.
  3. N=306 vs Meng 2024's ~5K studied band (real, but weakest signal here). Documented limitation; the SimPO paper does not characterise behaviour at this N.

This is the chal.md memo p2 "1 unresolved training failure" finding, not a project failure.

Intended use

  • Research artifact demonstrating a leak-controlled SimPO pipeline at small N on a B2B-sales preference task.
  • Educational โ€” code, data, traces, ablation results all reproduce from seed=11.
  • NOT for production B2B-sales generation. The held-out lift does not clear baseline at 95 % CI on the very bench it was trained for.

Out-of-scope

  • General preference scoring outside Tenacious B2B-sales register.
  • Non-English text. Bench is English-only.
  • Time-zones outside {Europe/Berlin, Europe/London, America/New_York, America/Los_Angeles, Africa/Addis_Ababa} โ€” bench coverage stops there.
  • Scoring outputs longer than ~300 words (training pairs were short emails).

Limitations

Limitation Detail
Held-out null lift ฮ” A 95 % CI crosses zero; do not use as a deployment gate
Single seed Only seed=11 evaluated; no seed-stability bound
Single backbone Only Qwen3-1.7B; transfer to other Qwen sizes / non-Qwen families unmeasured
Single judge Eval-tier judge is Sonnet 4.6 only; no multi-judge bias decomposition
TRC weakness Trace-derived adversarial tasks score 0.13โ€“0.85 across all conditions โ€” bench harder than critic
17/52 tie rate Candidate-pool variance is the deployment-time bottleneck

Environmental cost

  • Training compute: 35.9 min ร— 1 ร— Tesla T4 โ‰ˆ 25 Wh.
  • Held-out evaluation: ~50 min ร— 1 ร— Tesla T4 โ‰ˆ 34 Wh.
  • Inference per critic_rs call: ~5 forward passes at fp16 on Qwen3-1.7B.
  • Total project compute: well under 1 kWh.

Reproducibility

# Reproduce training (Colab T4)
jupyter notebook tenacious_critic_simpo_seed11.ipynb

# Reproduce held-out ablations (Colab T4)
python ablations/run_ablations.py --condition all --out ablations/held_out_traces.jsonl
python ablations/compute_deltas.py --traces ablations/held_out_traces.jsonl --out ablations/ablation_results.json

seed=11 everywhere. TENACIOUS_BENCH_OFFLINE=1 reproduces the labeling pipeline without API calls.

Citation

@misc{nahom2026tenacious_critic,
  author       = {Nahom, A.},
  title        = {Tenacious-Critic-Qwen3-1.7B-SimPO-seed11},
  year         = {2026},
  note         = {LoRA SimPO preference critic, 10academy TRP1 Week 11 final submission},
  howpublished = {HuggingFace Hub, \url{https://huggingface.co/nahdes/tenacious-critic-qwen3-1-7b-simpo-seed11}}
}

References:

  • Meng, Y., Xia, M., Chen, D. (2024). SimPO: Simple Preference Optimization with a Reference-Free Reward. NeurIPS 2024.
  • Li et al. (2025). Preference Leakage: A Contamination Problem in LLM-as-a-Judge.
  • See synthesis_memos/ for opinionated commentary on each.
Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for nahdes/tenacious-critic-qwen3-1-7b-simpo-seed11

Finetuned
Qwen/Qwen3-1.7B
Adapter
(14)
this model

Dataset used to train nahdes/tenacious-critic-qwen3-1-7b-simpo-seed11