---
title: Combining LLMs Rarely Beats the Single Best Model
emoji: 🎲
colorFrom: gray
colorTo: red
sdk: static
app_file: index.html
pinned: false
license: cc-by-4.0
short_description: "beta=P(all wrong): the co-failure ceiling on LLM ensembles"
datasets:
  - josefchen/co-failure-67-models
---

# Combining LLMs Rarely Beats the Single Best Model — interactive companion

A single-page, self-contained interactive companion to the arXiv paper
*Combining LLMs Rarely Beats the Single Best Model: A Provable Co-Failure Ceiling Across 67 Frontier Models* (Josef Chen, KAIKAKU). Paper: https://arxiv.org/abs/2606.27288

Every figure recomputes from the paper's committed outcome matrices over 67 frontier models. Four
interactive instruments:

1. **$0 realizability certificate** — enter an observed all-wrong count K/n and single-best accuracy;
   get the Clopper–Pearson-certified ceiling 1−β and the maximum gain any policy could deliver
   (the incomplete-beta is computed client-side; verified against the paper's intervals).
2. **Pool-size divergence** — a slider over pool size k showing the underpricing ratio widen with a
   composition-bootstrapped band.
3. **Four-domain regime map** — co-failure (β > 0) across two math benchmarks + execution-graded code,
   vanishing on multiple-choice.
4. **Content-controlled format flip** — the same GPQA-Diamond questions, multiple-choice vs free-response.

No build step; pure HTML/CSS/JS. Honest scope (LLM-judged open-ended panel; strict-but-not-official
code grader) is stated in the page and the paper.