Instructions to use nahdes/tenacious-critic-qwen3-1-7b-simpo-seed11 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use nahdes/tenacious-critic-qwen3-1-7b-simpo-seed11 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen3-1.7B") model = PeftModel.from_pretrained(base_model, "nahdes/tenacious-critic-qwen3-1-7b-simpo-seed11") - Notebooks
- Google Colab
- Kaggle
- Tenacious-Critic-Qwen3-1.7B-SimPO-seed11
Tenacious-Critic-Qwen3-1.7B-SimPO-seed11
A LoRA SimPO preference critic, trained for 35.9 minutes on a free Colab T4, and honestly underperforming on the held-out partition of the bench it was trained for.
| Field | Value |
|---|---|
| Repo (planned) | nahdes/tenacious-critic-qwen3-1-7b-simpo-seed11 |
| Backbone | unsloth/Qwen3-1.7B |
| Adapter | LoRA, rank 16, alpha 32, 7 modules (q/k/v/o + gate/up/down) |
| Training method | SimPO (Meng et al., 2024) โ reference-free, length-normalized |
| Pairs | 306 (148 v2-anchored + 158 mutator-augmented) |
| Compute | Colab T4 (Tesla T4, 16 GB), free tier |
| Wall clock | 35.9 min |
| Trainable params | 17.4 M / 1.74 B (1.00 %) |
| Precision | fp16 |
| Seed | 11 |
| License | CC-BY-4.0 (adapter weights) |
| Companion dataset | nahdes/tenacious-bench-v0.1 |
Headline numbers
Two numbers, both honest, in tension:
| Metric | Value | Setting |
|---|---|---|
| In-distribution validation accuracy | 96.9 % | 31 held-out preference pairs (same construction pipeline as training) |
| Reward margin (chosen โ rejected) | +6.496 | Same 31 val pairs, step 280 |
| Held-out task-level lift over baseline | +0.0025, 95% CI [โ0.019, +0.023], p=0.40 | 52 sealed Tenacious-Bench v0.1 tasks, paired bootstrap, B=10000 |
The critic learned the labeling rubric (96.9 % pair acc, +6.50 margin). It does not add task-level lift over a same-backbone single-shot baseline at 95 % CI. Use only as a research artifact.
Deploy recommendation: NO-DEPLOY for the held-out partition.
Why was this trained?
The Tenacious-Bench v0.1 project audits five Tenacious-only failure modes that ฯยฒ-Bench retail cannot probe (signal over-claiming, tone drift, ICP misclassification, AI-maturity-gated mispitch, EATโEU/US scheduling). Week 10 trace evidence (probe IDs probe_20260423T191351 rows 9, 11, 21 in run_log.jsonl) characterized these as inconsistency failures โ the agent gets it right most of the time but cannot tell when it's wrong. That diagnosis ruled out generation-quality (Path A) and trajectory (Path C) treatments and selected Path B: a preference-tuned critic deployed as a rejection-sampling layer.
Backbone substitution
chal.md named "Qwen 3.5" (0.8B / 2B / 4B band) as the eligible backbone. As of 2026-05-01, the Qwen 3.5 family had not been released. Substituted with unsloth/Qwen3-1.7B, the closest current open-weight match within the spec's parameter band. All other hyperparameters from the original spec are preserved.
Hyperparameters (Meng 2024 ยง4.3 verbatim)
| Param | Value |
|---|---|
loss_type |
simpo |
beta (ฮฒ) |
2.0 |
simpo_gamma (ฮณ) |
1.0 |
learning_rate |
5e-6 |
lr_scheduler |
cosine |
warmup_ratio |
0.1 |
epochs |
8 (extended from 3 โ see training history) |
total_steps |
280 |
effective_batch |
8 (per_device 2 ร grad_accum 4) |
max_seq_len |
2048 |
max_prompt_len |
1536 |
lora_r / lora_alpha / lora_dropout |
16 / 32 / 0.0 |
target_modules |
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
Training data
306 preference pairs, all anchored to train/ partition only (zero held-out leak):
- 148 v2 style-guide-anchored pairs: 12 GOOD + 12 BAD hand-labeled drafts (
Tenacious_Style_Guide_v2.md) scored against every train task withscoring_evaluator.py. Per task: best GOOD = chosen, worst BAD = rejected, margin โฅ 0.30 (median 0.477). - 158 mutator-augmented pairs: DeepSeek V3.2 (non-Qwen family) writes one strong-prompt chosen per train task; deterministic task-aware mutator corrupts that chosen along all 5 rubric dimensions (must_not tokens, off-segment fingerprint, pricing TCV, banned phrases, wrong-TZ, stacked asks) โ rejected at margin โฅ 0.30 (median 0.613).
Discarded path: organic strong-vs-weak DeepSeek synthesis. Reason: DeepSeek's safety/quality training sanitises the "weak SDR" prompt โ its rejected outputs scored 0.625 median, not bad enough for clean margin separation. The mutator was the working solution. Documented in training_data/build_log.json.
Leakage policy (Li et al., 2025 four-vector audit)
| Vector | Status |
|---|---|
| Generator/judge family overlap | Controlled โ labeling judge is rule-based (scoring_evaluator.py), no LLM |
| Generator/rewriter base sharing | Controlled โ chosens authored by human + DeepSeek V3.2; zero Qwen-family pairs |
| Train/test author overlap (same authors, different tasks) | Controlled โ held_out partition uses same authoring modes but different task instances; SHA-256 manifest sealed |
| Held-out distribution drift | Partial control โ 8-gram Jaccard < 0.30 + embedding cosine < 0.85; 0 violations |
Training history
Loss curve (full curve in training/training_run_seed11.json):
| Step | Train loss | Val loss | Reward margin | Reward acc |
|---|---|---|---|---|
| 20 | 1.196 | 1.481 | 0.034 | 0.542 |
| 105 (3 ep) | weak | โ | โ0.06 | 0.542 |
| 200 | 0.060 | 0.220 | 6.170 | 0.969 |
| 220 | 0.008 | 0.197 | 6.349 | 0.969 |
| 280 (final) | 0.107 | 0.187 | 6.496 | 0.969 |
Initial 3-epoch run produced a weak critic (margin โ0.06, 54 % acc). Resumed for 5 more epochs because val loss was monotonically decreasing. Train loss bottomed at step 220 (0.008) and drifted up slightly through step 280; val loss kept improving with no overfit signature. Step 280 selected.
Held-out evaluation (52 tasks)
Three conditions, all using unsloth/Qwen3-1.7B as the base generator:
| Condition | Setup | Mean ยฑ SD | Wall/task |
|---|---|---|---|
| baseline | adapter OFF, neutral system + task style_guide_excerpts, k=1, T=0 |
0.6618 ยฑ 0.20 | 19.4 s |
| critic_rs | adapter OFF for gen (k=4 @ T=0.7), adapter ON for ranking via length-normalized log-prob ร ฮฒ=2.0; top-1 selected | 0.6643 ยฑ 0.21 | 23.5 s |
| promptaug | adapter OFF, condensed v2 style guide (~900 tok) in system prompt, k=1, T=0 | 0.6472 ยฑ 0.20 | 20.1 s |
Paired-bootstrap deltas (B = 10 000, seed=11)
| Comparison | Mean ฮ | 95 % CI | p (one-sided) | W / L / T |
|---|---|---|---|---|
| ฮ A (critic_rs โ baseline) | +0.0025 | [โ0.019, +0.023] | 0.40 | 19 / 16 / 17 |
| ฮ B (critic_rs โ promptaug) | +0.0171 | [โ0.006, +0.042] | 0.08 | 22 / 17 / 13 |
| ฮ C (ฯยฒ-Bench retail) | informational only โ reused from week_10/seed/baseline_numbers.md per chal.md (no re-run) |
โ | โ | โ |
Per-stratum breakdown
| Stratum | n | baseline | critic_rs | promptaug | Best |
|---|---|---|---|---|---|
| ADV (hand-authored adversarial) | 9 | 0.657 | 0.658 | 0.711 | promptaug (+5.4pp) |
| PRG (combinatorial programmatic) | 22 | 0.703 | 0.732 | 0.687 | critic_rs (+2.9pp) |
| SYN (multi-LLM synthesis) | 10 | 0.831 | 0.802 | 0.763 | baseline |
| TRC (trace-derived) | 11 | 0.429 | 0.408 | 0.411 | baseline (all weak) |
The critic concentrates its lift on the PRG stratum (mechanical rubric features it was trained to recognise). Promptaug concentrates its lift on ADV (where the v2 rule list helps the base model handle edge cases). Both interventions hurt on SYN (well-formed cases the base handles cleanly already) and TRC (adversarial trace items defeat all three).
Cost-Pareto
| Condition | Mean fwd passes | Cost ratio vs baseline | Quality lift abs |
|---|---|---|---|
| baseline | 1.0 | 1.00ร | 0.000 |
| critic_rs | 5.0 | 1.21ร | +0.0025 |
| promptaug | 1.0 | 1.04ร | โ0.0146 |
critic_rs is strictly Pareto-dominated by baseline at 95 % CI: 1.21ร compute for โzero quality. promptaug is also dominated.
Why training transfer failed (the memo p2 finding)
Three hypotheses, ordered by trace-data support:
- Candidate-pool variance bottleneck (strongest). 17 / 52 ties on ฮ A. Same-generator k=4 candidates at T=0.7 produce outputs the offline rubric scores identically โ rejection sampling cannot improve a homogeneous pool. The critic was trained on pairs with margin โฅ 0.30; it never saw a pool this clustered. Future small-N work should scale candidate-pool variance (temperature sweeps, model-family mixing) rather than scaling N alone.
- In-distribution overfit (medium). 96.9 % val acc was on 31 pairs from the same v2-anchored + mutator-augmented distribution. Held-out bench candidates are a distribution shift the critic never trained on.
- N=306 vs Meng 2024's ~5K studied band (real, but weakest signal here). Documented limitation; the SimPO paper does not characterise behaviour at this N.
This is the chal.md memo p2 "1 unresolved training failure" finding, not a project failure.
Intended use
- Research artifact demonstrating a leak-controlled SimPO pipeline at small N on a B2B-sales preference task.
- Educational โ code, data, traces, ablation results all reproduce from
seed=11. - NOT for production B2B-sales generation. The held-out lift does not clear baseline at 95 % CI on the very bench it was trained for.
Out-of-scope
- General preference scoring outside Tenacious B2B-sales register.
- Non-English text. Bench is English-only.
- Time-zones outside
{Europe/Berlin, Europe/London, America/New_York, America/Los_Angeles, Africa/Addis_Ababa}โ bench coverage stops there. - Scoring outputs longer than ~300 words (training pairs were short emails).
Limitations
| Limitation | Detail |
|---|---|
| Held-out null lift | ฮ A 95 % CI crosses zero; do not use as a deployment gate |
| Single seed | Only seed=11 evaluated; no seed-stability bound |
| Single backbone | Only Qwen3-1.7B; transfer to other Qwen sizes / non-Qwen families unmeasured |
| Single judge | Eval-tier judge is Sonnet 4.6 only; no multi-judge bias decomposition |
| TRC weakness | Trace-derived adversarial tasks score 0.13โ0.85 across all conditions โ bench harder than critic |
| 17/52 tie rate | Candidate-pool variance is the deployment-time bottleneck |
Environmental cost
- Training compute: 35.9 min ร 1 ร Tesla T4 โ 25 Wh.
- Held-out evaluation: ~50 min ร 1 ร Tesla T4 โ 34 Wh.
- Inference per critic_rs call: ~5 forward passes at fp16 on Qwen3-1.7B.
- Total project compute: well under 1 kWh.
Reproducibility
# Reproduce training (Colab T4)
jupyter notebook tenacious_critic_simpo_seed11.ipynb
# Reproduce held-out ablations (Colab T4)
python ablations/run_ablations.py --condition all --out ablations/held_out_traces.jsonl
python ablations/compute_deltas.py --traces ablations/held_out_traces.jsonl --out ablations/ablation_results.json
seed=11 everywhere. TENACIOUS_BENCH_OFFLINE=1 reproduces the labeling pipeline without API calls.
Citation
@misc{nahom2026tenacious_critic,
author = {Nahom, A.},
title = {Tenacious-Critic-Qwen3-1.7B-SimPO-seed11},
year = {2026},
note = {LoRA SimPO preference critic, 10academy TRP1 Week 11 final submission},
howpublished = {HuggingFace Hub, \url{https://huggingface.co/nahdes/tenacious-critic-qwen3-1-7b-simpo-seed11}}
}
References:
- Meng, Y., Xia, M., Chen, D. (2024). SimPO: Simple Preference Optimization with a Reference-Free Reward. NeurIPS 2024.
- Li et al. (2025). Preference Leakage: A Contamination Problem in LLM-as-a-Judge.
- See
synthesis_memos/for opinionated commentary on each.
- Downloads last month
- 15