# HellaSwag Benchmark Audit — heretic-cerebellum-v1 **Date:** 2026-06-11 **Auditor:** Adversarial automated audit (adversarial-hellaswag-audit) **Benchmark run timestamp:** 2026-06-11 16:02 **File audited:** `heretic-cerebellum-v1_hellaswag_detailed.jsonl` **Reported score:** 91.78% --- ## Verdict: TRUSTWORTHY All six check categories pass. The reported score of 91.78% is correct. No artifacts, no fallbacks, no transport failures, no cache contamination detected. --- ## Section 1: Schema Reconnaissance **Fields present in every entry (10042/10042):** | Field | Type | Sample values | |-------|------|---------------| | `context` | str | "A man is sitting on a roof. he" | | `endings` | list[str] | 4 continuation options | | `expected` | str | "D", "C", "B", "A" | | `predicted` | str | "D", "C", "B", "A" | | `raw_response` | str | "D" (exactly 1 char, always) | | `correct` | bool | true / false | | `error` | null | always null | **No timestamp field.** Entries are in sequential dataset order. **raw_response length distribution:** All 10042 entries have `len(raw_response) == 1`. The model returned a single letter for every question with no verbose output to parse. --- ## Section 2: Accuracy Recount Three independent counting methods were applied: | Method | Correct | Total | Accuracy | |--------|---------|-------|----------| | `correct == True` field | 9217 | 10042 | 91.7845% | | `predicted == expected` comparison | 9217 | 10042 | 91.7845% | | Reported in results.json | — | 10042 | 91.78% | **Delta from reported: 0.0045%** — within single-decimal rounding. The reported 91.78% is accurate. Zero disagreements between the `correct` boolean field and the `predicted == expected` computed result across all 10042 entries. The bookkeeping is internally consistent. --- ## Section 3: Empty / Garbage Response Audit All 10042 entries were scanned. | Check | Count | Status | |-------|-------|--------| | Empty or whitespace `raw_response` | **0** | CLEAN | | Non-null `error` field | **0** | CLEAN | | Predicted letter absent from `raw_response` (silent parse-fallback signature) | **0** | CLEAN | The historical "empty-response fallback to option A" bug that cost 108 entries in a prior run is **not present here**. Every entry has a single-char raw response matching the predicted answer exactly. --- ## Section 4: Wrong-Answer Audit **Total wrong entries:** 825 **Sample:** 80 entries (random.seed(42)) | Classification | Count | Pct of sample | |----------------|-------|---------------| | REAL (genuine model error) | 80 | 100% | | ARTIFACT (script/transport bug) | 0 | 0% | ### 3 Representative REAL examples **Entry #4663** - Context: `[header] How to beat a "tough" person in a fight [title] Make the first move...` - Model predicted: **A** | Gold: **B** - Raw response: `'A'` - Assessment: Model picked a plausible but wrong continuation. **Entry #618** - Context: `People are riding camels in a desert area. two individuals that are leading the...` - Model predicted: **C** | Gold: **D** - Raw response: `'C'` - Assessment: Semantically close wrong answer; genuine comprehension error. **Entry #130** - Context: `A shot of a cyclist is shown and then it cuts back to the same man in white spea...` - Model predicted: **C** | Gold: **B** - Raw response: `'C'` - Assessment: Genuine model error on an ambiguous video-description question. **No artifacts found in the 80-entry sample.** All wrong answers show single-letter responses that are plausible wrong choices from the option set. --- ## Section 5: Answer Distribution No fallback-to-first-option bias detected. Expected ~25% per option; all within normal variance. ### Model picks | Option | Count | Percentage | Flag | |--------|-------|------------|------| | A | 2492 | 24.82% | — | | B | 2403 | 23.93% | — | | C | 2543 | 25.32% | — | | D | 2604 | 25.93% | — | ### Gold distribution | Option | Count | Percentage | |--------|-------|------------| | A | 2515 | 25.04% | | B | 2485 | 24.75% | | C | 2584 | 25.73% | | D | 2458 | 24.48% | **The 35% threshold was not breached by any option.** The model's answer distribution closely tracks the gold distribution, which is the expected signature of a well-calibrated model rather than a fallback pattern. --- ## Section 6: Timing / Contiguous Failure Check No timestamp field is present in the JSONL; sequential position is the only ordering available. | Metric | Value | |--------|-------| | Longest consecutive wrong-answer streak | **5** (entries #967–971) | | Any streak >= 10 (transport-failure threshold) | **None** | **The streak-of-5 was manually inspected.** All five entries show valid single-letter predictions for distinct questions with legitimate wrong but plausible answers: - Entry 967: predicted B, gold D (newscast/gymnastics) - Entry 968: predicted D, gold B (same context topic, different question) - Entry 969: predicted C, gold B (toothbrush/bathroom) - Entry 970: predicted A, gold B (harmonica player) - Entry 971: predicted B, gold C (skateboarding) No shared context or repeated endings — these are five independent questions. The streak is noise, not a transport stall. --- ## Section 7: Meta / Cache Verification Contents of `heretic-cerebellum-v1_meta.json`: ```json { "model_size": "unknown", "model_name": "heretic-cerebellum-v1", "port": 7890 } ``` **Observations:** - `model_name` matches the filename prefix (`heretic-cerebellum-v1`). No cache contamination from a different model identity. - `model_size: "unknown"` is incomplete but not a contradiction — the GGUF filename would be the authoritative source. - `port: 7890` is consistent with a dedicated bench server (not the default 8080). This port is distinct from the production inference server (7800), which is correct practice. - No model path or SHA fingerprint is stored, so hardware-level fingerprint verification is not possible from this file alone. This is a metadata weakness but does not contradict the run data. **Cache contamination check:** The JSONL filename, results JSON, and meta JSON all reference `heretic-cerebellum-v1` consistently. The run timestamp (16:02) matches the JSONL mtime. No evidence of a stale cache from a different model. --- ## Summary Table | Check | Result | Detail | |-------|--------|--------| | Entry count | PASS | 10042 entries, 0 parse errors | | Accuracy recount | PASS | 91.7845% computed vs 91.78% reported (delta 0.0045%) | | `correct` field vs computed | PASS | 0 disagreements across all 10042 entries | | Empty raw_response | PASS | 0 empty entries | | Error field set | PASS | 0 error-flagged entries | | Silent parse fallback | PASS | 0 entries where letter absent from raw | | Wrong-answer artifacts (80-sample) | PASS | 0 artifacts / 80 REAL | | Answer distribution — model | PASS | A:24.82% B:23.93% C:25.32% D:25.93% (max 25.93%, below 35% threshold) | | Answer distribution — gold | PASS | Uniform, no anomalies | | Contiguous wrong streak | PASS | Max streak = 5, below 10-entry threshold | | Meta model identity | PASS | model_name matches filename | | Cache contamination | PASS | No cross-model fingerprint mismatch | --- ## Final Verdict **TRUSTWORTHY** - **Recount accuracy:** 91.7845% (rounds to 91.78%) — matches reported score exactly - **Empty response count:** 0 - **Artifact count in 80-entry wrong-answer sample:** 0 - **Cache contamination flag:** None - **Corrected score:** Not needed — reported score is accurate The score can be published as-is. The only metadata gap is the absence of a model path or SHA in `meta.json`; future runs should record the GGUF path for traceability. This is a bookkeeping recommendation, not a validity concern.