deucebucket's picture
results: full audited benchmark suite + adversarial audit reports
9cdbf39 verified
|
Raw
History Blame
6.87 kB

Benchmark Audit: ARC-Challenge + MMLU-Redux

Model: heretic-cerebellum-v1
Auditor: adversarial / automated
Audit date: 2026-06-11
Audited files:

  • heretic-cerebellum-v1_arc_detailed.jsonl (1172 entries)
  • heretic-cerebellum-v1_mmlu_redux_detailed.jsonl (2400 entries)

Verdict

Benchmark Verdict Reported Recount Artifact errors
ARC-Challenge TRUSTWORTHY 95.48% 95.48% 0
MMLU-Redux TRUSTWORTHY 75.42% 75.42% 0

No artifacts, no parse failures, no label-format bugs, no truncation signature detected. All 1172 ARC and 2400 MMLU entries are internally consistent.


1. Schema Reconnaissance

Both files use the same schema per line:

{
  "question": "...",
  "choices": ["A text", "B text", "C text", "D text"],
  "expected": "C",
  "predicted": "C",
  "raw_response": "C",
  "correct": true,
  "error": null
}

MMLU adds a "subject" field. The raw_response field stores exactly what the model returned β€” in every entry across both benchmarks this is a single uppercase letter (A/B/C/D). There are no multi-token completions, no reasoning traces, no chain-of-thought artifacts.


2. Aggregate Verification

Recount performed by independently summing correct == true flags:

Benchmark Total Correct Wrong Recount acc Reported acc Match
ARC 1172 1119 53 95.48% 95.48% YES
MMLU 2400 1810 590 75.42% 75.42% YES

Both match to 2 decimal places. The summary JSONs are not lying.


3. Wrong-Answer Classification

ARC-Challenge: all 53 wrong entries

Class Count
REAL_ERROR (model chose wrong letter) 53
ARTIFACT_EMPTY 0
ARTIFACT_UNPARSEABLE 0
ARTIFACT_PARSE_MISMATCH 0
ARTIFACT_NUMERIC_LABEL 0

0 artifacts out of 53 wrong answers.

MMLU-Redux: 60-entry random sample (seed=42) of 590 wrong entries

Class Count
REAL_ERROR (model chose wrong letter) 60
ARTIFACT_EMPTY 0
ARTIFACT_UNPARSEABLE 0
ARTIFACT_PARSE_MISMATCH 0
ARTIFACT_NUMERIC_LABEL 0

0 artifacts out of 60 sampled wrong answers. At 0/60 artifact rate, the 95% CI for artifact prevalence in the full wrong population is 0–6% (Wilson interval). The most pessimistic reading: ~35 of the 590 wrong answers could be artifacts; even so the corrected score would be 75.42% + (35/2400)*100 = ~76.9%. The floor of the score cannot drop below reported.


4. Distribution Checks

4a. Choice distribution (predicted vs gold)

ARC:

Choice Predicted Gold Delta
A 269 266 +3
B 312 311 +1
C 304 310 -6
D 287 285 +2

Deltas are ≀6. No evidence of parser defaulting to any single choice.

MMLU:

Choice Predicted Gold Delta
A 497 537 -40
B 614 600 +14
C 613 606 +7
D 676 657 +19

The model under-picks A and over-picks D relative to gold distribution. This is a model-level tendency, not a parser artifact β€” A-defaulting (the known parser-fallback bug) would produce the opposite signature (over-picking A).

4b. Empty/whitespace raw responses across ALL entries

  • ARC: 0 / 1172
  • MMLU: 0 / 2400

No empty responses anywhere.

4c. Parsed choice absent from raw_response (all entries with long responses)

All 3572 entries have single-character raw responses. The parsed predicted field equals raw_response in 100% of entries (0 mismatches in either benchmark).

4d. correct flag internal consistency

  • ARC entries where correct=True but predicted != expected: 0
  • MMLU entries where correct=True but predicted != expected: 0
  • ARC entries where correct=False but predicted == expected: 0
  • MMLU entries where correct=False but predicted == expected: 0

The correct flag is computed correctly from predicted == expected with no exceptions.

4e. First-option bias among wrong answers

  • ARC wrong answers predicted as 'A': 12/53 = 22.6% (expected if random: 25%)
  • MMLU wrong answers predicted as 'A': 112/590 = 19.0% (expected if random: 25%)

If anything, the model slightly under-picks 'A' when wrong β€” no first-option parser bias.


5. Truncation Analysis (MMLU)

Prompt length was approximated as len(question) + sum(len(choice) for choice in choices). This is a proxy for the actual tokenized prompt, but faithfully captures long-vs-short relative ordering.

Wrong rate by prompt-length decile:

Decile Len range (chars) Wrong rate
1 (shortest) 17–101 24.2%
2 101–129 23.8%
3 129–154 29.2%
4 154–184 25.4%
5 184–219 25.4%
6 219–259 22.1%
7 259–315 25.0%
8 316–382 25.0%
9 382–489 25.0%
10 (longest) 489–4872 20.8%

No truncation signature. The longest decile (489–4872 chars, including a 4872-char outlier) has the lowest wrong rate (20.8%), not the highest. If context truncation were occurring, deciles 9–10 would show elevated error rates. The distribution is flat across deciles, with decile 3 as the minor high point (29.2%) β€” almost certainly subject-difficulty driven, not length-driven.

Mean prompt length of correct vs wrong answers:

  • Correct: 277 chars
  • Wrong: 259 chars

Wrong answers are marginally shorter in prompt length on average, the opposite of what truncation would produce.

ARC truncation check:

  • Wrong answers mean prompt length (217) < correct answers (249). Same anti-truncation pattern.

No context-per-slot truncation artifacts in either benchmark.


6. Known Historical Bug Cross-Check

Bug Check Status
Numeric-vs-letter label mismatch (cost 19 ARC questions) ARTIFACT_NUMERIC_LABEL count 0 in both
Empty responses counted as wrong empty raw_response 0 in both
Parser fallback picking first option (A-bias) wrong-answer A% vs expected 25% ARC 22.6%, MMLU 19.0% β€” no bias
API errors counted as wrong error field non-null 0 in both
Context-per-slot truncation prompt-length decile wrong rate Flat; longest decile lowest error

All five known historical bugs: not present.


7. Corrected Scores

No correction needed. Recount matches reported scores exactly. Zero artifact errors detected in all sampled and exhaustively audited wrong answers.

Final scores:

  • ARC-Challenge: 95.48% (1119/1172) β€” confirmed
  • MMLU-Redux: 75.42% (1810/2400) β€” confirmed