--- title: Combining LLMs Rarely Beats the Single Best Model emoji: 🎲 colorFrom: gray colorTo: red sdk: static app_file: index.html pinned: false license: cc-by-4.0 short_description: "beta=P(all wrong): the co-failure ceiling on LLM ensembles" datasets: - josefchen/co-failure-67-models --- # Combining LLMs Rarely Beats the Single Best Model — interactive companion A single-page, self-contained interactive companion to the arXiv paper *Combining LLMs Rarely Beats the Single Best Model: A Provable Co-Failure Ceiling Across 67 Frontier Models* (Josef Chen, KAIKAKU). Paper: https://arxiv.org/abs/2606.27288 Every figure recomputes from the paper's committed outcome matrices over 67 frontier models. Four interactive instruments: 1. **$0 realizability certificate** — enter an observed all-wrong count K/n and single-best accuracy; get the Clopper–Pearson-certified ceiling 1−β and the maximum gain any policy could deliver (the incomplete-beta is computed client-side; verified against the paper's intervals). 2. **Pool-size divergence** — a slider over pool size k showing the underpricing ratio widen with a composition-bootstrapped band. 3. **Four-domain regime map** — co-failure (β > 0) across two math benchmarks + execution-graded code, vanishing on multiple-choice. 4. **Content-controlled format flip** — the same GPQA-Diamond questions, multiple-choice vs free-response. No build step; pure HTML/CSS/JS. Honest scope (LLM-judged open-ended panel; strict-but-not-official code grader) is stated in the page and the paper.