# Benchmark Audit: ARC-Challenge + MMLU-Redux
**Model:** heretic-cerebellum-v1  
**Auditor:** adversarial / automated  
**Audit date:** 2026-06-11  
**Audited files:**
- `heretic-cerebellum-v1_arc_detailed.jsonl` (1172 entries)
- `heretic-cerebellum-v1_mmlu_redux_detailed.jsonl` (2400 entries)

---

## Verdict

| Benchmark | Verdict | Reported | Recount | Artifact errors |
|-----------|---------|----------|---------|-----------------|
| ARC-Challenge | **TRUSTWORTHY** | 95.48% | 95.48% | 0 |
| MMLU-Redux | **TRUSTWORTHY** | 75.42% | 75.42% | 0 |

No artifacts, no parse failures, no label-format bugs, no truncation signature detected. All 1172 ARC and 2400 MMLU entries are internally consistent.

---

## 1. Schema Reconnaissance

Both files use the same schema per line:

```json
{
  "question": "...",
  "choices": ["A text", "B text", "C text", "D text"],
  "expected": "C",
  "predicted": "C",
  "raw_response": "C",
  "correct": true,
  "error": null
}
```

MMLU adds a `"subject"` field. The `raw_response` field stores exactly what the model returned — in every entry across both benchmarks this is a single uppercase letter (A/B/C/D). There are no multi-token completions, no reasoning traces, no chain-of-thought artifacts.

---

## 2. Aggregate Verification

Recount performed by independently summing `correct == true` flags:

| Benchmark | Total | Correct | Wrong | Recount acc | Reported acc | Match |
|-----------|-------|---------|-------|-------------|--------------|-------|
| ARC | 1172 | 1119 | 53 | **95.48%** | 95.48% | YES |
| MMLU | 2400 | 1810 | 590 | **75.42%** | 75.42% | YES |

Both match to 2 decimal places. The summary JSONs are not lying.

---

## 3. Wrong-Answer Classification

### ARC-Challenge: all 53 wrong entries

| Class | Count |
|-------|-------|
| REAL_ERROR (model chose wrong letter) | **53** |
| ARTIFACT_EMPTY | 0 |
| ARTIFACT_UNPARSEABLE | 0 |
| ARTIFACT_PARSE_MISMATCH | 0 |
| ARTIFACT_NUMERIC_LABEL | 0 |

**0 artifacts out of 53 wrong answers.**

### MMLU-Redux: 60-entry random sample (seed=42) of 590 wrong entries

| Class | Count |
|-------|-------|
| REAL_ERROR (model chose wrong letter) | **60** |
| ARTIFACT_EMPTY | 0 |
| ARTIFACT_UNPARSEABLE | 0 |
| ARTIFACT_PARSE_MISMATCH | 0 |
| ARTIFACT_NUMERIC_LABEL | 0 |

**0 artifacts out of 60 sampled wrong answers.** At 0/60 artifact rate, the 95% CI for artifact prevalence in the full wrong population is 0–6% (Wilson interval). The most pessimistic reading: ~35 of the 590 wrong answers could be artifacts; even so the corrected score would be 75.42% + (35/2400)*100 = ~76.9%. The floor of the score cannot drop below reported.

---

## 4. Distribution Checks

### 4a. Choice distribution (predicted vs gold)

**ARC:**
| Choice | Predicted | Gold | Delta |
|--------|-----------|------|-------|
| A | 269 | 266 | +3 |
| B | 312 | 311 | +1 |
| C | 304 | 310 | -6 |
| D | 287 | 285 | +2 |

Deltas are ≤6. No evidence of parser defaulting to any single choice.

**MMLU:**
| Choice | Predicted | Gold | Delta |
|--------|-----------|------|-------|
| A | 497 | 537 | -40 |
| B | 614 | 600 | +14 |
| C | 613 | 606 | +7 |
| D | 676 | 657 | +19 |

The model under-picks A and over-picks D relative to gold distribution. This is a model-level tendency, not a parser artifact — A-defaulting (the known parser-fallback bug) would produce the opposite signature (over-picking A).

### 4b. Empty/whitespace raw responses across ALL entries

- ARC: **0 / 1172**
- MMLU: **0 / 2400**

No empty responses anywhere.

### 4c. Parsed choice absent from raw_response (all entries with long responses)

All 3572 entries have single-character raw responses. The parsed `predicted` field equals `raw_response` in 100% of entries (0 mismatches in either benchmark).

### 4d. `correct` flag internal consistency

- ARC entries where `correct=True` but `predicted != expected`: **0**
- MMLU entries where `correct=True` but `predicted != expected`: **0**
- ARC entries where `correct=False` but `predicted == expected`: **0**
- MMLU entries where `correct=False` but `predicted == expected`: **0**

The `correct` flag is computed correctly from `predicted == expected` with no exceptions.

### 4e. First-option bias among wrong answers

- ARC wrong answers predicted as 'A': 12/53 = **22.6%** (expected if random: 25%)
- MMLU wrong answers predicted as 'A': 112/590 = **19.0%** (expected if random: 25%)

If anything, the model slightly under-picks 'A' when wrong — no first-option parser bias.

---

## 5. Truncation Analysis (MMLU)

Prompt length was approximated as `len(question) + sum(len(choice) for choice in choices)`. This is a proxy for the actual tokenized prompt, but faithfully captures long-vs-short relative ordering.

**Wrong rate by prompt-length decile:**

| Decile | Len range (chars) | Wrong rate |
|--------|-------------------|------------|
| 1 (shortest) | 17–101 | 24.2% |
| 2 | 101–129 | 23.8% |
| 3 | 129–154 | 29.2% |
| 4 | 154–184 | 25.4% |
| 5 | 184–219 | 25.4% |
| 6 | 219–259 | 22.1% |
| 7 | 259–315 | 25.0% |
| 8 | 316–382 | 25.0% |
| 9 | 382–489 | 25.0% |
| 10 (longest) | 489–4872 | **20.8%** |

**No truncation signature.** The longest decile (489–4872 chars, including a 4872-char outlier) has the *lowest* wrong rate (20.8%), not the highest. If context truncation were occurring, deciles 9–10 would show elevated error rates. The distribution is flat across deciles, with decile 3 as the minor high point (29.2%) — almost certainly subject-difficulty driven, not length-driven.

**Mean prompt length of correct vs wrong answers:**
- Correct: 277 chars
- Wrong: 259 chars

Wrong answers are marginally *shorter* in prompt length on average, the opposite of what truncation would produce.

**ARC truncation check:**
- Wrong answers mean prompt length (217) < correct answers (249). Same anti-truncation pattern.

No context-per-slot truncation artifacts in either benchmark.

---

## 6. Known Historical Bug Cross-Check

| Bug | Check | Status |
|-----|-------|--------|
| Numeric-vs-letter label mismatch (cost 19 ARC questions) | ARTIFACT_NUMERIC_LABEL count | **0 in both** |
| Empty responses counted as wrong | empty raw_response | **0 in both** |
| Parser fallback picking first option (A-bias) | wrong-answer A% vs expected 25% | **ARC 22.6%, MMLU 19.0% — no bias** |
| API errors counted as wrong | `error` field non-null | **0 in both** |
| Context-per-slot truncation | prompt-length decile wrong rate | **Flat; longest decile lowest error** |

All five known historical bugs: **not present**.

---

## 7. Corrected Scores

No correction needed. Recount matches reported scores exactly. Zero artifact errors detected in all sampled and exhaustively audited wrong answers.

**Final scores:**
- ARC-Challenge: **95.48%** (1119/1172) — confirmed
- MMLU-Redux: **75.42%** (1810/2400) — confirmed