deucebucket's picture
results: full audited benchmark suite + adversarial audit reports
9cdbf39 verified
|
Raw
History Blame
7.74 kB

HellaSwag Benchmark Audit β€” heretic-cerebellum-v1

Date: 2026-06-11
Auditor: Adversarial automated audit (adversarial-hellaswag-audit)
Benchmark run timestamp: 2026-06-11 16:02
File audited: heretic-cerebellum-v1_hellaswag_detailed.jsonl
Reported score: 91.78%


Verdict: TRUSTWORTHY

All six check categories pass. The reported score of 91.78% is correct. No artifacts, no fallbacks, no transport failures, no cache contamination detected.


Section 1: Schema Reconnaissance

Fields present in every entry (10042/10042):

Field Type Sample values
context str "A man is sitting on a roof. he"
endings list[str] 4 continuation options
expected str "D", "C", "B", "A"
predicted str "D", "C", "B", "A"
raw_response str "D" (exactly 1 char, always)
correct bool true / false
error null always null

No timestamp field. Entries are in sequential dataset order.

raw_response length distribution: All 10042 entries have len(raw_response) == 1. The model returned a single letter for every question with no verbose output to parse.


Section 2: Accuracy Recount

Three independent counting methods were applied:

Method Correct Total Accuracy
correct == True field 9217 10042 91.7845%
predicted == expected comparison 9217 10042 91.7845%
Reported in results.json β€” 10042 91.78%

Delta from reported: 0.0045% β€” within single-decimal rounding. The reported 91.78% is accurate.

Zero disagreements between the correct boolean field and the predicted == expected computed result across all 10042 entries. The bookkeeping is internally consistent.


Section 3: Empty / Garbage Response Audit

All 10042 entries were scanned.

Check Count Status
Empty or whitespace raw_response 0 CLEAN
Non-null error field 0 CLEAN
Predicted letter absent from raw_response (silent parse-fallback signature) 0 CLEAN

The historical "empty-response fallback to option A" bug that cost 108 entries in a prior run is not present here. Every entry has a single-char raw response matching the predicted answer exactly.


Section 4: Wrong-Answer Audit

Total wrong entries: 825
Sample: 80 entries (random.seed(42))

Classification Count Pct of sample
REAL (genuine model error) 80 100%
ARTIFACT (script/transport bug) 0 0%

3 Representative REAL examples

Entry #4663

  • Context: [header] How to beat a "tough" person in a fight [title] Make the first move...
  • Model predicted: A | Gold: B
  • Raw response: 'A'
  • Assessment: Model picked a plausible but wrong continuation.

Entry #618

  • Context: People are riding camels in a desert area. two individuals that are leading the...
  • Model predicted: C | Gold: D
  • Raw response: 'C'
  • Assessment: Semantically close wrong answer; genuine comprehension error.

Entry #130

  • Context: A shot of a cyclist is shown and then it cuts back to the same man in white spea...
  • Model predicted: C | Gold: B
  • Raw response: 'C'
  • Assessment: Genuine model error on an ambiguous video-description question.

No artifacts found in the 80-entry sample. All wrong answers show single-letter responses that are plausible wrong choices from the option set.


Section 5: Answer Distribution

No fallback-to-first-option bias detected. Expected ~25% per option; all within normal variance.

Model picks

Option Count Percentage Flag
A 2492 24.82% β€”
B 2403 23.93% β€”
C 2543 25.32% β€”
D 2604 25.93% β€”

Gold distribution

Option Count Percentage
A 2515 25.04%
B 2485 24.75%
C 2584 25.73%
D 2458 24.48%

The 35% threshold was not breached by any option. The model's answer distribution closely tracks the gold distribution, which is the expected signature of a well-calibrated model rather than a fallback pattern.


Section 6: Timing / Contiguous Failure Check

No timestamp field is present in the JSONL; sequential position is the only ordering available.

Metric Value
Longest consecutive wrong-answer streak 5 (entries #967–971)
Any streak >= 10 (transport-failure threshold) None

The streak-of-5 was manually inspected. All five entries show valid single-letter predictions for distinct questions with legitimate wrong but plausible answers:

  • Entry 967: predicted B, gold D (newscast/gymnastics)
  • Entry 968: predicted D, gold B (same context topic, different question)
  • Entry 969: predicted C, gold B (toothbrush/bathroom)
  • Entry 970: predicted A, gold B (harmonica player)
  • Entry 971: predicted B, gold C (skateboarding)

No shared context or repeated endings β€” these are five independent questions. The streak is noise, not a transport stall.


Section 7: Meta / Cache Verification

Contents of heretic-cerebellum-v1_meta.json:

{
  "model_size": "unknown",
  "model_name": "heretic-cerebellum-v1",
  "port": 7890
}

Observations:

  • model_name matches the filename prefix (heretic-cerebellum-v1). No cache contamination from a different model identity.
  • model_size: "unknown" is incomplete but not a contradiction β€” the GGUF filename would be the authoritative source.
  • port: 7890 is consistent with a dedicated bench server (not the default 8080). This port is distinct from the production inference server (7800), which is correct practice.
  • No model path or SHA fingerprint is stored, so hardware-level fingerprint verification is not possible from this file alone. This is a metadata weakness but does not contradict the run data.

Cache contamination check: The JSONL filename, results JSON, and meta JSON all reference heretic-cerebellum-v1 consistently. The run timestamp (16:02) matches the JSONL mtime. No evidence of a stale cache from a different model.


Summary Table

Check Result Detail
Entry count PASS 10042 entries, 0 parse errors
Accuracy recount PASS 91.7845% computed vs 91.78% reported (delta 0.0045%)
correct field vs computed PASS 0 disagreements across all 10042 entries
Empty raw_response PASS 0 empty entries
Error field set PASS 0 error-flagged entries
Silent parse fallback PASS 0 entries where letter absent from raw
Wrong-answer artifacts (80-sample) PASS 0 artifacts / 80 REAL
Answer distribution β€” model PASS A:24.82% B:23.93% C:25.32% D:25.93% (max 25.93%, below 35% threshold)
Answer distribution β€” gold PASS Uniform, no anomalies
Contiguous wrong streak PASS Max streak = 5, below 10-entry threshold
Meta model identity PASS model_name matches filename
Cache contamination PASS No cross-model fingerprint mismatch

Final Verdict

TRUSTWORTHY

  • Recount accuracy: 91.7845% (rounds to 91.78%) β€” matches reported score exactly
  • Empty response count: 0
  • Artifact count in 80-entry wrong-answer sample: 0
  • Cache contamination flag: None
  • Corrected score: Not needed β€” reported score is accurate

The score can be published as-is. The only metadata gap is the absence of a model path or SHA in meta.json; future runs should record the GGUF path for traceability. This is a bookkeeping recommendation, not a validity concern.