--- license: cc-by-4.0 language: - en base_model: unsloth/Qwen3-1.7B library_name: peft pipeline_tag: text-generation tags: - lora - simpo - preference-optimization - rejection-sampling - sales - b2b - qwen3 - tenacious datasets: - nahdes/tenacious-bench-v0.1 --- # Tenacious-Critic-Qwen3-1.7B-SimPO-seed11 A LoRA SimPO preference critic, trained for 35.9 minutes on a free Colab T4, and **honestly underperforming** on the held-out partition of the bench it was trained for. | Field | Value | |---|---| | Repo (planned) | `nahdes/tenacious-critic-qwen3-1-7b-simpo-seed11` | | Backbone | `unsloth/Qwen3-1.7B` | | Adapter | LoRA, rank 16, alpha 32, 7 modules (q/k/v/o + gate/up/down) | | Training method | SimPO (Meng et al., 2024) — reference-free, length-normalized | | Pairs | 306 (148 v2-anchored + 158 mutator-augmented) | | Compute | Colab T4 (Tesla T4, 16 GB), free tier | | Wall clock | 35.9 min | | Trainable params | 17.4 M / 1.74 B (1.00 %) | | Precision | fp16 | | Seed | 11 | | License | CC-BY-4.0 (adapter weights) | | Companion dataset | `nahdes/tenacious-bench-v0.1` | --- ## Headline numbers Two numbers, both honest, in tension: | Metric | Value | Setting | |---|---|---| | **In-distribution validation accuracy** | **96.9 %** | 31 held-out preference pairs (same construction pipeline as training) | | **Reward margin (chosen − rejected)** | **+6.496** | Same 31 val pairs, step 280 | | **Held-out task-level lift over baseline** | **+0.0025**, 95% CI [−0.019, +0.023], p=0.40 | 52 sealed Tenacious-Bench v0.1 tasks, paired bootstrap, B=10000 | The critic learned the labeling rubric (96.9 % pair acc, +6.50 margin). It does **not** add task-level lift over a same-backbone single-shot baseline at 95 % CI. Use only as a research artifact. **Deploy recommendation: NO-DEPLOY** for the held-out partition. --- ## Why was this trained? The Tenacious-Bench v0.1 project audits five Tenacious-only failure modes that τ²-Bench retail cannot probe (signal over-claiming, tone drift, ICP misclassification, AI-maturity-gated mispitch, EAT↔EU/US scheduling). Week 10 trace evidence (probe IDs `probe_20260423T191351` rows 9, 11, 21 in `run_log.jsonl`) characterized these as **inconsistency** failures — the agent gets it right most of the time but cannot tell when it's wrong. That diagnosis ruled out generation-quality (Path A) and trajectory (Path C) treatments and selected Path B: a preference-tuned critic deployed as a rejection-sampling layer. ## Backbone substitution `chal.md` named "Qwen 3.5" (0.8B / 2B / 4B band) as the eligible backbone. As of 2026-05-01, **the Qwen 3.5 family had not been released**. Substituted with `unsloth/Qwen3-1.7B`, the closest current open-weight match within the spec's parameter band. All other hyperparameters from the original spec are preserved. ## Hyperparameters (Meng 2024 §4.3 verbatim) | Param | Value | |---|---| | `loss_type` | `simpo` | | `beta` (β) | 2.0 | | `simpo_gamma` (γ) | 1.0 | | `learning_rate` | 5e-6 | | `lr_scheduler` | cosine | | `warmup_ratio` | 0.1 | | `epochs` | 8 (extended from 3 — see training history) | | `total_steps` | 280 | | `effective_batch` | 8 (per_device 2 × grad_accum 4) | | `max_seq_len` | 2048 | | `max_prompt_len` | 1536 | | `lora_r` / `lora_alpha` / `lora_dropout` | 16 / 32 / 0.0 | | `target_modules` | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | ## Training data 306 preference pairs, all anchored to `train/` partition only (zero held-out leak): - **148 v2 style-guide-anchored pairs**: 12 GOOD + 12 BAD hand-labeled drafts (`Tenacious_Style_Guide_v2.md`) scored against every train task with `scoring_evaluator.py`. Per task: best GOOD = chosen, worst BAD = rejected, margin ≥ 0.30 (median 0.477). - **158 mutator-augmented pairs**: DeepSeek V3.2 (non-Qwen family) writes one strong-prompt chosen per train task; deterministic task-aware mutator corrupts that chosen along all 5 rubric dimensions (must_not tokens, off-segment fingerprint, pricing TCV, banned phrases, wrong-TZ, stacked asks) → rejected at margin ≥ 0.30 (median 0.613). **Discarded path**: organic strong-vs-weak DeepSeek synthesis. Reason: DeepSeek's safety/quality training sanitises the "weak SDR" prompt — its rejected outputs scored 0.625 median, not bad enough for clean margin separation. The mutator was the working solution. Documented in `training_data/build_log.json`. ### Leakage policy (Li et al., 2025 four-vector audit) | Vector | Status | |---|---| | Generator/judge family overlap | **Controlled** — labeling judge is rule-based (`scoring_evaluator.py`), no LLM | | Generator/rewriter base sharing | **Controlled** — chosens authored by human + DeepSeek V3.2; zero Qwen-family pairs | | Train/test author overlap (same authors, different tasks) | **Controlled** — held_out partition uses same authoring modes but different task instances; SHA-256 manifest sealed | | Held-out distribution drift | **Partial control** — 8-gram Jaccard < 0.30 + embedding cosine < 0.85; 0 violations | ## Training history Loss curve (full curve in `training/training_run_seed11.json`): | Step | Train loss | Val loss | Reward margin | Reward acc | |---|---|---|---|---| | 20 | 1.196 | 1.481 | 0.034 | 0.542 | | 105 (3 ep) | weak | — | −0.06 | 0.542 | | 200 | 0.060 | 0.220 | 6.170 | 0.969 | | 220 | 0.008 | 0.197 | 6.349 | 0.969 | | 280 (final) | 0.107 | 0.187 | **6.496** | **0.969** | Initial 3-epoch run produced a weak critic (margin −0.06, 54 % acc). Resumed for 5 more epochs because val loss was monotonically decreasing. Train loss bottomed at step 220 (0.008) and drifted up slightly through step 280; val loss kept improving with no overfit signature. Step 280 selected. ## Held-out evaluation (52 tasks) Three conditions, all using `unsloth/Qwen3-1.7B` as the base generator: | Condition | Setup | Mean ± SD | Wall/task | |---|---|---|---| | **baseline** | adapter OFF, neutral system + task `style_guide_excerpts`, k=1, T=0 | 0.6618 ± 0.20 | 19.4 s | | **critic_rs** | adapter OFF for gen (k=4 @ T=0.7), adapter ON for ranking via length-normalized log-prob × β=2.0; top-1 selected | 0.6643 ± 0.21 | 23.5 s | | **promptaug** | adapter OFF, condensed v2 style guide (~900 tok) in system prompt, k=1, T=0 | 0.6472 ± 0.20 | 20.1 s | ### Paired-bootstrap deltas (B = 10 000, seed=11) | Comparison | Mean Δ | 95 % CI | p (one-sided) | W / L / T | |---|---|---|---|---| | **Δ A** (critic_rs − baseline) | +0.0025 | [−0.019, +0.023] | 0.40 | 19 / 16 / **17** | | **Δ B** (critic_rs − promptaug) | +0.0171 | [−0.006, +0.042] | 0.08 | 22 / 17 / 13 | | Δ C (τ²-Bench retail) | informational only — reused from `week_10/seed/baseline_numbers.md` per chal.md (no re-run) | — | — | — | ### Per-stratum breakdown | Stratum | n | baseline | critic_rs | promptaug | Best | |---|---|---|---|---|---| | ADV (hand-authored adversarial) | 9 | 0.657 | 0.658 | **0.711** | promptaug (+5.4pp) | | PRG (combinatorial programmatic) | 22 | 0.703 | **0.732** | 0.687 | critic_rs (+2.9pp) | | SYN (multi-LLM synthesis) | 10 | **0.831** | 0.802 | 0.763 | baseline | | TRC (trace-derived) | 11 | **0.429** | 0.408 | 0.411 | baseline (all weak) | The critic concentrates its lift on the **PRG** stratum (mechanical rubric features it was trained to recognise). Promptaug concentrates its lift on **ADV** (where the v2 rule list helps the base model handle edge cases). Both interventions hurt on **SYN** (well-formed cases the base handles cleanly already) and **TRC** (adversarial trace items defeat all three). ### Cost-Pareto | Condition | Mean fwd passes | Cost ratio vs baseline | Quality lift abs | |---|---|---|---| | baseline | 1.0 | 1.00× | 0.000 | | critic_rs | 5.0 | 1.21× | +0.0025 | | promptaug | 1.0 | 1.04× | −0.0146 | `critic_rs` is **strictly Pareto-dominated** by baseline at 95 % CI: 1.21× compute for ≈zero quality. `promptaug` is also dominated. ## Why training transfer failed (the memo p2 finding) Three hypotheses, ordered by trace-data support: 1. **Candidate-pool variance bottleneck (strongest).** 17 / 52 ties on Δ A. Same-generator k=4 candidates at T=0.7 produce outputs the offline rubric scores identically — rejection sampling cannot improve a homogeneous pool. The critic was trained on pairs with margin ≥ 0.30; it never saw a pool this clustered. Future small-N work should scale candidate-pool variance (temperature sweeps, model-family mixing) rather than scaling N alone. 2. **In-distribution overfit (medium).** 96.9 % val acc was on 31 pairs from the same v2-anchored + mutator-augmented distribution. Held-out *bench* candidates are a distribution shift the critic never trained on. 3. **N=306 vs Meng 2024's ~5K studied band (real, but weakest signal here).** Documented limitation; the SimPO paper does not characterise behaviour at this N. This is the chal.md memo p2 "1 unresolved training failure" finding, not a project failure. ## Intended use - **Research artifact** demonstrating a leak-controlled SimPO pipeline at small N on a B2B-sales preference task. - **Educational** — code, data, traces, ablation results all reproduce from `seed=11`. - **NOT for production B2B-sales generation.** The held-out lift does not clear baseline at 95 % CI on the very bench it was trained for. ## Out-of-scope - General preference scoring outside Tenacious B2B-sales register. - Non-English text. Bench is English-only. - Time-zones outside `{Europe/Berlin, Europe/London, America/New_York, America/Los_Angeles, Africa/Addis_Ababa}` — bench coverage stops there. - Scoring outputs longer than ~300 words (training pairs were short emails). ## Limitations | Limitation | Detail | |---|---| | Held-out null lift | Δ A 95 % CI crosses zero; do not use as a deployment gate | | Single seed | Only `seed=11` evaluated; no seed-stability bound | | Single backbone | Only Qwen3-1.7B; transfer to other Qwen sizes / non-Qwen families unmeasured | | Single judge | Eval-tier judge is Sonnet 4.6 only; no multi-judge bias decomposition | | TRC weakness | Trace-derived adversarial tasks score 0.13–0.85 across all conditions — bench harder than critic | | 17/52 tie rate | Candidate-pool variance is the deployment-time bottleneck | ## Environmental cost - Training compute: 35.9 min × 1 × Tesla T4 ≈ **25 Wh**. - Held-out evaluation: ~50 min × 1 × Tesla T4 ≈ **34 Wh**. - Inference per critic_rs call: ~5 forward passes at fp16 on Qwen3-1.7B. - Total project compute: well under 1 kWh. ## Reproducibility ```bash # Reproduce training (Colab T4) jupyter notebook tenacious_critic_simpo_seed11.ipynb # Reproduce held-out ablations (Colab T4) python ablations/run_ablations.py --condition all --out ablations/held_out_traces.jsonl python ablations/compute_deltas.py --traces ablations/held_out_traces.jsonl --out ablations/ablation_results.json ``` `seed=11` everywhere. `TENACIOUS_BENCH_OFFLINE=1` reproduces the labeling pipeline without API calls. ## Citation ```bibtex @misc{nahom2026tenacious_critic, author = {Nahom, A.}, title = {Tenacious-Critic-Qwen3-1.7B-SimPO-seed11}, year = {2026}, note = {LoRA SimPO preference critic, 10academy TRP1 Week 11 final submission}, howpublished = {HuggingFace Hub, \url{https://huggingface.co/nahdes/tenacious-critic-qwen3-1-7b-simpo-seed11}} } ``` References: - Meng, Y., Xia, M., Chen, D. (2024). *SimPO: Simple Preference Optimization with a Reference-Free Reward.* NeurIPS 2024. - Li et al. (2025). *Preference Leakage: A Contamination Problem in LLM-as-a-Judge.* - See `synthesis_memos/` for opinionated commentary on each.