---
license: cc-by-4.0
language:
  - en
base_model: unsloth/Qwen3-1.7B
library_name: peft
pipeline_tag: text-generation
tags:
  - lora
  - simpo
  - preference-optimization
  - rejection-sampling
  - sales
  - b2b
  - qwen3
  - tenacious
datasets:
  - nahdes/tenacious-bench-v0.1
---

# Tenacious-Critic-Qwen3-1.7B-SimPO-seed11

A LoRA SimPO preference critic, trained for 35.9 minutes on a free Colab T4, and **honestly underperforming** on the held-out partition of the bench it was trained for.

| Field | Value |
|---|---|
| Repo (planned) | `nahdes/tenacious-critic-qwen3-1-7b-simpo-seed11` |
| Backbone | `unsloth/Qwen3-1.7B` |
| Adapter | LoRA, rank 16, alpha 32, 7 modules (q/k/v/o + gate/up/down) |
| Training method | SimPO (Meng et al., 2024) — reference-free, length-normalized |
| Pairs | 306 (148 v2-anchored + 158 mutator-augmented) |
| Compute | Colab T4 (Tesla T4, 16 GB), free tier |
| Wall clock | 35.9 min |
| Trainable params | 17.4 M / 1.74 B (1.00 %) |
| Precision | fp16 |
| Seed | 11 |
| License | CC-BY-4.0 (adapter weights) |
| Companion dataset | `nahdes/tenacious-bench-v0.1` |

---

## Headline numbers

Two numbers, both honest, in tension:

| Metric | Value | Setting |
|---|---|---|
| **In-distribution validation accuracy** | **96.9 %** | 31 held-out preference pairs (same construction pipeline as training) |
| **Reward margin (chosen − rejected)** | **+6.496** | Same 31 val pairs, step 280 |
| **Held-out task-level lift over baseline** | **+0.0025**, 95% CI [−0.019, +0.023], p=0.40 | 52 sealed Tenacious-Bench v0.1 tasks, paired bootstrap, B=10000 |

The critic learned the labeling rubric (96.9 % pair acc, +6.50 margin). It does **not** add task-level lift over a same-backbone single-shot baseline at 95 % CI. Use only as a research artifact.

**Deploy recommendation: NO-DEPLOY** for the held-out partition.

---

## Why was this trained?

The Tenacious-Bench v0.1 project audits five Tenacious-only failure modes that τ²-Bench retail cannot probe (signal over-claiming, tone drift, ICP misclassification, AI-maturity-gated mispitch, EAT↔EU/US scheduling). Week 10 trace evidence (probe IDs `probe_20260423T191351` rows 9, 11, 21 in `run_log.jsonl`) characterized these as **inconsistency** failures — the agent gets it right most of the time but cannot tell when it's wrong. That diagnosis ruled out generation-quality (Path A) and trajectory (Path C) treatments and selected Path B: a preference-tuned critic deployed as a rejection-sampling layer.

## Backbone substitution

`chal.md` named "Qwen 3.5" (0.8B / 2B / 4B band) as the eligible backbone. As of 2026-05-01, **the Qwen 3.5 family had not been released**. Substituted with `unsloth/Qwen3-1.7B`, the closest current open-weight match within the spec's parameter band. All other hyperparameters from the original spec are preserved.

## Hyperparameters (Meng 2024 §4.3 verbatim)

| Param | Value |
|---|---|
| `loss_type` | `simpo` |
| `beta` (β) | 2.0 |
| `simpo_gamma` (γ) | 1.0 |
| `learning_rate` | 5e-6 |
| `lr_scheduler` | cosine |
| `warmup_ratio` | 0.1 |
| `epochs` | 8 (extended from 3 — see training history) |
| `total_steps` | 280 |
| `effective_batch` | 8 (per_device 2 × grad_accum 4) |
| `max_seq_len` | 2048 |
| `max_prompt_len` | 1536 |
| `lora_r` / `lora_alpha` / `lora_dropout` | 16 / 32 / 0.0 |
| `target_modules` | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |

## Training data

306 preference pairs, all anchored to `train/` partition only (zero held-out leak):

- **148 v2 style-guide-anchored pairs**: 12 GOOD + 12 BAD hand-labeled drafts (`Tenacious_Style_Guide_v2.md`) scored against every train task with `scoring_evaluator.py`. Per task: best GOOD = chosen, worst BAD = rejected, margin ≥ 0.30 (median 0.477).
- **158 mutator-augmented pairs**: DeepSeek V3.2 (non-Qwen family) writes one strong-prompt chosen per train task; deterministic task-aware mutator corrupts that chosen along all 5 rubric dimensions (must_not tokens, off-segment fingerprint, pricing TCV, banned phrases, wrong-TZ, stacked asks) → rejected at margin ≥ 0.30 (median 0.613).

**Discarded path**: organic strong-vs-weak DeepSeek synthesis. Reason: DeepSeek's safety/quality training sanitises the "weak SDR" prompt — its rejected outputs scored 0.625 median, not bad enough for clean margin separation. The mutator was the working solution. Documented in `training_data/build_log.json`.

### Leakage policy (Li et al., 2025 four-vector audit)

| Vector | Status |
|---|---|
| Generator/judge family overlap | **Controlled** — labeling judge is rule-based (`scoring_evaluator.py`), no LLM |
| Generator/rewriter base sharing | **Controlled** — chosens authored by human + DeepSeek V3.2; zero Qwen-family pairs |
| Train/test author overlap (same authors, different tasks) | **Controlled** — held_out partition uses same authoring modes but different task instances; SHA-256 manifest sealed |
| Held-out distribution drift | **Partial control** — 8-gram Jaccard < 0.30 + embedding cosine < 0.85; 0 violations |

## Training history

Loss curve (full curve in `training/training_run_seed11.json`):

| Step | Train loss | Val loss | Reward margin | Reward acc |
|---|---|---|---|---|
| 20 | 1.196 | 1.481 | 0.034 | 0.542 |
| 105 (3 ep) | weak | — | −0.06 | 0.542 |
| 200 | 0.060 | 0.220 | 6.170 | 0.969 |
| 220 | 0.008 | 0.197 | 6.349 | 0.969 |
| 280 (final) | 0.107 | 0.187 | **6.496** | **0.969** |

Initial 3-epoch run produced a weak critic (margin −0.06, 54 % acc). Resumed for 5 more epochs because val loss was monotonically decreasing. Train loss bottomed at step 220 (0.008) and drifted up slightly through step 280; val loss kept improving with no overfit signature. Step 280 selected.

## Held-out evaluation (52 tasks)

Three conditions, all using `unsloth/Qwen3-1.7B` as the base generator:

| Condition | Setup | Mean ± SD | Wall/task |
|---|---|---|---|
| **baseline** | adapter OFF, neutral system + task `style_guide_excerpts`, k=1, T=0 | 0.6618 ± 0.20 | 19.4 s |
| **critic_rs** | adapter OFF for gen (k=4 @ T=0.7), adapter ON for ranking via length-normalized log-prob × β=2.0; top-1 selected | 0.6643 ± 0.21 | 23.5 s |
| **promptaug** | adapter OFF, condensed v2 style guide (~900 tok) in system prompt, k=1, T=0 | 0.6472 ± 0.20 | 20.1 s |

### Paired-bootstrap deltas (B = 10 000, seed=11)

| Comparison | Mean Δ | 95 % CI | p (one-sided) | W / L / T |
|---|---|---|---|---|
| **Δ A** (critic_rs − baseline) | +0.0025 | [−0.019, +0.023] | 0.40 | 19 / 16 / **17** |
| **Δ B** (critic_rs − promptaug) | +0.0171 | [−0.006, +0.042] | 0.08 | 22 / 17 / 13 |
| Δ C (τ²-Bench retail) | informational only — reused from `week_10/seed/baseline_numbers.md` per chal.md (no re-run) | — | — | — |

### Per-stratum breakdown

| Stratum | n | baseline | critic_rs | promptaug | Best |
|---|---|---|---|---|---|
| ADV (hand-authored adversarial) | 9 | 0.657 | 0.658 | **0.711** | promptaug (+5.4pp) |
| PRG (combinatorial programmatic) | 22 | 0.703 | **0.732** | 0.687 | critic_rs (+2.9pp) |
| SYN (multi-LLM synthesis) | 10 | **0.831** | 0.802 | 0.763 | baseline |
| TRC (trace-derived) | 11 | **0.429** | 0.408 | 0.411 | baseline (all weak) |

The critic concentrates its lift on the **PRG** stratum (mechanical rubric features it was trained to recognise). Promptaug concentrates its lift on **ADV** (where the v2 rule list helps the base model handle edge cases). Both interventions hurt on **SYN** (well-formed cases the base handles cleanly already) and **TRC** (adversarial trace items defeat all three).

### Cost-Pareto

| Condition | Mean fwd passes | Cost ratio vs baseline | Quality lift abs |
|---|---|---|---|
| baseline | 1.0 | 1.00× | 0.000 |
| critic_rs | 5.0 | 1.21× | +0.0025 |
| promptaug | 1.0 | 1.04× | −0.0146 |

`critic_rs` is **strictly Pareto-dominated** by baseline at 95 % CI: 1.21× compute for ≈zero quality. `promptaug` is also dominated.

## Why training transfer failed (the memo p2 finding)

Three hypotheses, ordered by trace-data support:

1. **Candidate-pool variance bottleneck (strongest).** 17 / 52 ties on Δ A. Same-generator k=4 candidates at T=0.7 produce outputs the offline rubric scores identically — rejection sampling cannot improve a homogeneous pool. The critic was trained on pairs with margin ≥ 0.30; it never saw a pool this clustered. Future small-N work should scale candidate-pool variance (temperature sweeps, model-family mixing) rather than scaling N alone.
2. **In-distribution overfit (medium).** 96.9 % val acc was on 31 pairs from the same v2-anchored + mutator-augmented distribution. Held-out *bench* candidates are a distribution shift the critic never trained on.
3. **N=306 vs Meng 2024's ~5K studied band (real, but weakest signal here).** Documented limitation; the SimPO paper does not characterise behaviour at this N.

This is the chal.md memo p2 "1 unresolved training failure" finding, not a project failure.

## Intended use

- **Research artifact** demonstrating a leak-controlled SimPO pipeline at small N on a B2B-sales preference task.
- **Educational** — code, data, traces, ablation results all reproduce from `seed=11`.
- **NOT for production B2B-sales generation.** The held-out lift does not clear baseline at 95 % CI on the very bench it was trained for.

## Out-of-scope

- General preference scoring outside Tenacious B2B-sales register.
- Non-English text. Bench is English-only.
- Time-zones outside `{Europe/Berlin, Europe/London, America/New_York, America/Los_Angeles, Africa/Addis_Ababa}` — bench coverage stops there.
- Scoring outputs longer than ~300 words (training pairs were short emails).

## Limitations

| Limitation | Detail |
|---|---|
| Held-out null lift | Δ A 95 % CI crosses zero; do not use as a deployment gate |
| Single seed | Only `seed=11` evaluated; no seed-stability bound |
| Single backbone | Only Qwen3-1.7B; transfer to other Qwen sizes / non-Qwen families unmeasured |
| Single judge | Eval-tier judge is Sonnet 4.6 only; no multi-judge bias decomposition |
| TRC weakness | Trace-derived adversarial tasks score 0.13–0.85 across all conditions — bench harder than critic |
| 17/52 tie rate | Candidate-pool variance is the deployment-time bottleneck |

## Environmental cost

- Training compute: 35.9 min × 1 × Tesla T4 ≈ **25 Wh**.
- Held-out evaluation: ~50 min × 1 × Tesla T4 ≈ **34 Wh**.
- Inference per critic_rs call: ~5 forward passes at fp16 on Qwen3-1.7B.
- Total project compute: well under 1 kWh.

## Reproducibility

```bash
# Reproduce training (Colab T4)
jupyter notebook tenacious_critic_simpo_seed11.ipynb

# Reproduce held-out ablations (Colab T4)
python ablations/run_ablations.py --condition all --out ablations/held_out_traces.jsonl
python ablations/compute_deltas.py --traces ablations/held_out_traces.jsonl --out ablations/ablation_results.json
```

`seed=11` everywhere. `TENACIOUS_BENCH_OFFLINE=1` reproduces the labeling pipeline without API calls.

## Citation

```bibtex
@misc{nahom2026tenacious_critic,
  author       = {Nahom, A.},
  title        = {Tenacious-Critic-Qwen3-1.7B-SimPO-seed11},
  year         = {2026},
  note         = {LoRA SimPO preference critic, 10academy TRP1 Week 11 final submission},
  howpublished = {HuggingFace Hub, \url{https://huggingface.co/nahdes/tenacious-critic-qwen3-1-7b-simpo-seed11}}
}
```

References:

- Meng, Y., Xia, M., Chen, D. (2024). *SimPO: Simple Preference Optimization with a Reference-Free Reward.* NeurIPS 2024.
- Li et al. (2025). *Preference Leakage: A Contamination Problem in LLM-as-a-Judge.*
- See `synthesis_memos/` for opinionated commentary on each.