# La Leaderboard v2 — LLM Leaderboard for Ibero-American Languages

[![Paper](https://img.shields.io/badge/ACL%202025-Paper-blue)](https://aclanthology.org/2025.acl-long.1561/)
[![Space](https://img.shields.io/badge/🤗%20Space-La%20Leaderboard-orange)](https://huggingface.co/spaces/pauvanbr/la-leaderboard-v2)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)

**La Leaderboard** is the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. This is the **v2** update with an upgraded evaluation framework, expanded language support, and an open roadmap for culturally-relevant datasets.

## 🚀 What's New in v2

- **Modern lm-evaluation-harness**: Upgraded from a legacy fork to `lm-eval>=0.4.11` with YAML-based task definitions and the modern CLI (`lm-eval run`)
- **Expanded language support**: Added Valencian (VA) and Portuguese (PT) language tabs
- **Updated dependencies**: Transformers 4.51, Gradio 5.25, Datasets 3.5, Python 3.11+
- **Extended precision support**: 8bit, 4bit, GPTQ
- **Open dataset roadmap**: New README annex documenting potential culturally-relevant datasets for Spain and LATAM

## 📊 Architecture

```
la-leaderboard-v2/                     # Gradio Space (frontend)
la-leaderboard-v2-requests/          # HF Dataset (evaluation queue)
la-leaderboard-v2-results/           # HF Dataset (evaluation results)
```

### Frontend (this repo)
- `app.py` — Gradio UI with tabs: Summary, ES, CA, EU, GL, VA, PT, Time/CO2, Info, Tasks, Submit
- `src/` — Data processing, leaderboard rendering, submission handling
- `tasks/` — Task registry (CSV + generated JSON/YAML for harness)

### Backend (separate Space recommended)
- Polls `requests` dataset for PENDING evaluations
- Runs `lm-eval run` with `--include_path` pointing to custom task YAMLs
- Pushes results to `results` dataset

## 🛠️ Reproducibility

### Install modern lm-eval-harness

```bash
pip install lm-eval>=0.4.11
```

### Run full leaderboard evaluation

```bash
lm-eval run --model hf \
  --model_args "pretrained=<your_model>,revision=<rev>,dtype=<dtype>" \
  --tasks=laleaderboard \
  --num_fewshot=5 \
  --device="cuda:0" \
  --batch_size=auto \
  --output_path=<output_path>
```

### Run single-language evaluation

```bash
lm-eval run --model hf \
  --model_args "pretrained=<your_model>,revision=<rev>,dtype=<dtype>" \
  --tasks=laleaderboard_es \
  --num_fewshot=5 \
  --device="cuda:0" \
  --batch_size=auto
```

Supported language suffixes: `es`, `ca`, `eu`, `gl`, `va`, `pt`.

### Validate custom tasks

```bash
lm-eval validate --tasks my_custom_task --include_path ./custom_tasks
```

## 📁 Task Registry

Tasks are defined in `tasks/tasks.csv`. Run `python tasks/generate.py` to regenerate:
- `tasks/backend.json` — Flat list of harness task names for the backend
- `tasks/dummy_results.json` — Template result JSON for new submissions
- `tasks/harness.yaml` — Harness benchmark group definition

## 🧩 Custom Task Format (YAML)

The modern harness uses YAML-based task configs. Example for a Spanish exam task:

```yaml
tag:
  - multiple_choice
task: ebas_matematicas
dataset_path: org/spanish-exam-dataset
dataset_name: null
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: null
doc_to_text: "Pregunta de examen: {{question}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nRespuesta:"
doc_to_target: "{{answer}}"
doc_to_choice: ["A", "B", "C", "D"]
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
metadata:
  version: 1.0
```

Place custom task YAMLs in a directory and pass `--include_path ./my_tasks` to `lm-eval run`.

## 📎 Annex — Potential Datasets for Spanish Cultural & Linguistic Evaluation

> This annex documents datasets, benchmarks, and data sources that could **add value** to La Leaderboard by capturing the cultural, linguistic, and idiosyncratic richness of Spain and Latin America. Not all are publicly available or digitized yet; this is a living roadmap for community contributions.

---

### 🇪🇸 1. Spanish Government & Competitive Exams (Digitization Required)

Following the successful model of **LLMzSzŁ** (Polish national exams) and **KMMLU** (Korean CSAT/bar exams), Spain has a rich ecosystem of standardized competitive exams that are ideal for LLM evaluation:

| Exam | Description | Source / How to Obtain | Status | Estimated Size |
|------|-------------|----------------------|--------|----------------|
| **EBAU / Selectividad / PAU / EvAU / PCE** | University entrance exams by autonomous community. Subjects: Maths, Physics, Chemistry, History, Spanish Language, English, Philosophy. | Each CCAA publishes PDFs on their education portals (e.g., `educacion.jccm.es`, `ensenyament.gencat.cat`, `juntadeandalucia.es/educacion`). Also aggregated on `examenesdepau.com`, `muchosexamenes.com`. | ⚠️ Needs PDF digitization (PyPDF + regex pipeline) | ~5,000–10,000 questions/year across 17 CCAA × 10 years |
| **Oposiciones** (Civil Service) | Competitive exams for public administration positions: administrative, teaching, nursing, firefighter, police. Extremely diverse subject matter. | `boe.es` (convocatorias), `aprende.gob.es`, community portals. Some already digitized on preparation platforms. | ⚠️ Fragmented; needs community scraping + OCR for image-based PDFs | Highly variable; potentially 20,000+ questions |
| **MIR** (Medical Residency) | Already partially digitized by `casimedicos.com` and on HF as `HiTZ/casimedicos-exp`. | `casimedicos.com`, `mirial.es`, Ministry of Health. | ✅ Partially available; needs 2023–2025 updates | ~3,000–5,000 questions/year |
| **EIR** (Nursing Residency) | Similar structure to MIR, for nursing. | `casimedicos.com` (nursing section), Ministry of Health. | ⚠️ Needs digitization | ~2,000 questions/year |
| **QIR** (Pharmacy), **FIR** (Physiotherapy), **PIR** (Psychology) | Specialized medical residency exams. | Same sources as MIR. | ⚠️ Needs digitization | ~1,000–2,000 each/year |
| **PEvAU / EBAU Andalucía** | Specific Andalusian university entrance exam with public PDF archive. | `juntadeandalucia.es/educacion` → "Exámenes de acceso a la universidad" | ✅ Public PDFs available | ~500 questions/year |
| **Acceso Mayor 25/45** | University access for adults over 25/45. Questions test general knowledge + specific subjects. | CCAA education portals. | ⚠️ Needs digitization | ~1,000 questions/year |
| **FP Grado Superior** | Vocational training access exams. Technical + general knowledge. | CCAA education portals, `todofp.es`. | ⚠️ Needs digitization | ~2,000 questions/year |

**Digitization pipeline** (based on LLMzSzŁ methodology):
1. Bulk-download PDFs by year/subject from official portals
2. Extract text layer with `PyPDF2`/`pdfplumber`
3. Regex-parse: question number → question text → options A/B/C/D → correct answer
4. Handle image-based PDFs with OCR (`Tesseract`, `easyOCR`, `pymupdf`)
5. Manual review of answer keys (often in separate PDFs)
6. Output: `{"question": "...", "choices": [...], "answer": 0, "subject": "Historia", "year": 2023, "ccaa": "Andalucía"}`

**Important**: To avoid data contamination, test splits should be held back and NOT published until after models are evaluated, following the KMMLU private-test-set model.

---

### 🏛️ 2. Legal Datasets (Spain-Specific)

| Dataset | Description | Source | Status | Format |
|---------|-------------|--------|--------|--------|
| **BOE Legal Corpus** | Full text of Spanish Official State Gazette. Massive legal text resource. | `boe.es` (daily bulletin), `Pepere45/spanish-boe-legal-corpus` on HF | ⚠️ Gated / needs request | Text / XML |
| **CENDOJ** (Centro de Documentación Judicial) | Spanish case law (jurisprudencia) from Supreme Court, National Court, etc. | `cej-mjusticia.es` — has API for XML download | ✅ API available | XML |
| **BOE-XSUM** | Extreme summarization of BOE documents into plain Spanish. Tests legal comprehension + simplification. | Paper: arXiv:2509.24908 | ⚠️ Needs re-creation or permission | Text pairs |
| **PlanTL-GOB-ES lm-legal-es** | Spanish legal-domain language model + corpus from BOE, CENDOJ. | GitHub: `PlanTL-GOB-ES/lm-legal-es` | ⚠️ Partially available | Text |
| **SpaLawEx** | Already in La Leaderboard! Spanish Law School Access Exams. | `LenguajeNaturalAI/examenes_abogacia` on HF | ✅ Available | MCQ |
| **European Court of Human Rights (Spanish)** | Case descriptions + violation/no-violation labels. | `icLR/echr_cases` or build from HUDOC | ⚠️ Needs filtering for Spanish | Text + label |
| **Spanish Constitutional Court rulings** | TC rulings with subject classification. | `tribunalconstitucional.es` | ⚠️ Needs scraping + annotation | Text |
| **EU Law (Spanish)** | EUR-Lex corpus in Spanish. EU directives, regulations, decisions. | `eur-lex.europa.eu` | ✅ Public; needs download pipeline | Text / XML |

**Suggested task formats**:
- **Legal MCQ**: Extract multiple-choice questions from bar exam prep books (e.g., *Civitas*, *Tecnos*). Copyright-restricted but publishers may collaborate.
- **Legal NLI**: Given a BOE article + a citizen query, determine if the article entails/contradicts/is neutral to the query.
- **Legal summarization**: Simplify a BOE resolution into layperson Spanish.
- **Legal entailment**: Given a CENDOJ ruling summary, determine which legal principle applies.

---

### 🏥 3. Medical & Health Datasets

| Dataset | Description | Source | Status |
|---------|-------------|--------|--------|
| **CasiMedicos / HEAD-QA v2** | Already in La Leaderboard! MIR medical exam with explanations. | `HiTZ/casimedicos-exp` on HF | ✅ Available, multilingual (es/en/fr/it) |
| **MedExpQA** (ES) | Medical explanatory QA with reasoning chains. | Mentioned in La Leaderboard paper Table 2; check with authors | ⚠️ Needs confirmation |
| **Spanish clinical notes (anonymized)** | Real clinical notes for NER/entity linking. | PlanTL-GOB-ES, BSC projects | ⚠️ Privacy-restricted |
| **Spanish medical Wikipedia** | Medical articles from Spanish Wikipedia. | `es.wikipedia.org` (CC BY-SA) | ✅ Public; needs QA generation |
| **Medicina en español (MEDLINE)** | Spanish-language biomedical abstracts. | PubMed/MEDLINE Spanish filter | ✅ Public |
| **Farmacovigilancia (AEMPS)** | Spanish drug safety reports. | `aemps.gob.es` | ⚠️ Needs request + anonymization |

**Suggested medical tasks**:
- **Differential diagnosis**: Given symptoms in Spanish, select most likely diagnosis (from MIR case descriptions)
- **Drug interaction**: Given two medications, determine if there's a known interaction (from AEMPS data)
- **Patient-doctor dialogue**: Evaluate if model can produce culturally appropriate Spanish medical explanations

---

### 🎭 4. Cultural Knowledge & Idiosyncrasy Datasets (High Priority Gap)

This is the **biggest gap** in current benchmarks. No existing leaderboard adequately tests LLM knowledge of Spanish culture, history, traditions, gastronomy, sports, and regional idiosyncrasies. Inspired by **CulturalBench** (16 countries, but Spain not included) and **BertaQA** (Basque cultural trivia):

#### 4.1 Spain-Specific Cultural Knowledge Benchmark (Proposed: *IberiaCult*)

**Methodology** (based on CulturalBench / BLEnD):
1. Seed queries on Spanish cultural topics
2. Human-AI collaborative generation: native Spaniards write questions, LLMs generate distractors, humans validate
3. 5-annotator majority vote for correct answers
4. MinHash deduplication against major pretraining corpora

| Topic Domain | Example Questions | Source of Inspiration |
|--------------|-------------------|----------------------|
| **History** | "¿Quién fue el primer presidente de la democracia española?" / "¿En qué año se produjo el 23-F?" | Spanish history curriculum, Wikipedia |
| **Geography** | "¿Cuál es la capital de La Rioja?" / "¿Qué río separa España de Portugal?" | School geography, IGN maps |
| **Traditions & Fiestas** | "¿En qué ciudad se celebra las Fallas?" / "¿Qué se arroja en la Tomatina?" | Official fiesta calendars, UNESCO intangible heritage |
| **Gastronomy** | "¿Cuál es el ingrediente principal del gazpacho andaluz?" / "¿De qué zona es el vino albariño?" | Denominaciones de Origen (MAPA), Spanish cookbooks |
| **Sports** | "¿Cuántas Ligas ha ganado el Real Madrid?" / "¿Quién ganó la medalla de oro en 100m lisos en Barcelona 92?" | RFEF, COE, press archives |
| **Politics & Constitution** | "¿Cuántos artículos tiene la Constitución española de 1978?" / "¿Qué comunidades autónomas tienen cooficialidad lingüística?" | Congreso.es, Constitutional text |
| **Art & Literature** | "¿Quién pintó 'Las Meninas'?" / "¿En qué ciudad nació Federico García Lorca?" | Museo del Prado, RAE, school curriculum |
| **Music & Cinema** | "¿Qué grupo español cantó 'La ciudad de los gatos'?" / "¿Quién dirigió 'El espíritu de la colmena'?" | RTVE archives, SGAE |
| **Daily Life & Idioms** | "¿Qué significa la expresión 'tener morro'?" / "¿A qué hora se suele cenar en España?" | RAE DLE, colloquial corpora, Reddit r/Spain |
| **Regional Knowledge** | "¿En qué provincia se encuentra el Parque Nacional de Ordesa?" / "¿Qué idioma se habla en el Val d'Aran?" | CCAA tourism + education portals |

**Potential data sources**:
- **Trivial Pursuit España** editions (question cards — copyright)
- **Pasapalabra** (Spanish TV quiz show — transcripts available)
- **Saber y Ganar** (TVE cultural quiz — public broadcaster, may have archives)
- **Spanish school textbooks** (history, geography, philosophy — copyright)
- **Wikipedia es** (CC BY-SA — can generate QA with modern pipelines)
- **Reddit r/AskSpain, r/Spain** threads (CC — filter for factual questions)
- **RTVE archives** (public TV — cultural programs with transcripts)

---

### 🌎 5. Latin American Cultural & Linguistic Varieties

| Dataset / Source | Language/Variety | Description | How to Obtain |
|------------------|-----------------|-------------|---------------|
| **PAES Chile** | es-CL | Chilean university entrance exam (Prueba de Acceso a la Educación Superior). Similar to EBAU. | `demre.cl` publishes PDFs with answer keys. | 
| **ENEM Brasil** (Portuguese) | pt-BR | Brazilian national high school exam. Portuguese baseline for LATAM. | `enem.inep.gov.br` — microdata available. |
| **Mexican EXANI** | es-MX | Mexican university entrance exam (CENEVAL). | `ceneval.edu.mx` — some materials public. |
| **Colombian ICFES / Saber 11** | es-CO | Colombian standardized tests. | `icfes.gov.co` — public datasets. |
| **Argentine CBC / Ingreso** | es-AR | Argentine university entrance exams. | Each university (UBA, UNT, etc.) publishes exams. |
| **Voces Originarias** | Indigenous | Indigenous voices and languages of LATAM. | Mentioned in La Leaderboard paper; check with authors. |
| **AmericasNLP** | Indigenous | NLI, MT for Aymara, Guaraní, Quechua, Náhuatl, etc. | `americasnlp.org` + HF Hub |
| **Meta4XNLI** | Indigenous | Cross-lingual NLI for indigenous languages. | Paper: check authors for data access |
| **ASALE / RAE DLE** | es | Royal Spanish Academy Dictionary — authority on Spanish usage. | `rae.es` — API for definitions (for linguistic acceptability tasks) |
| **CREA / Corpus de Referencia del Español Actual** | es | RAE reference corpus of modern Spanish usage. | `rae.es` — may require academic agreement |
| **CORLEC / Corpus Oral de Lenguaje de Especialidad** | es | Spanish specialized language corpus. | UAM / UCM — academic access |

---

### 🗣️ 6. Regional & Minority Languages of Spain (Beyond Co-Official)

The current La Leaderboard covers Spanish, Catalan, Basque, and Galician. Spain has additional recognized languages that lack LLM benchmarks:

| Language | Status | Existing Resources | What's Needed |
|----------|--------|-------------------|---------------|
| **Aragonese** | Recognized (Aragón) | FLORES+ (WMT 2024): `oldi.org/cards/flores/arg_Latn` | Reading comprehension, QA, NLI benchmark |
| **Asturian / Bable** | Recognized (Asturias) | FLORES+: `oldi.org/cards/flores/ast_Latn`, PILAR project | Reading comprehension, QA |
| **Leonese** | Recognized (Castilla y León) | Very limited | Needs corpus creation from oral history projects |
| **Extremaduran** | Recognized (Extremadura) | Very limited | Needs community collection |
| **Fala** | Recognized (Extremadura) | Very limited | Needs community collection |
| **Occitan / Aranese** | Co-official in Val d'Aran (Catalonia) | FLORES+, some Aranese school materials | Reading comprehension, basic QA |
| **Caló** | Historical Romani variety in Spain | Very limited | Ethical considerations; work with Roma communities |

**Suggested approach**: Partner with regional language academies (Academia de la Llingua Asturiana, Academia Aragonesa de la Lengua) to digitize school textbooks and create reading comprehension + QA benchmarks.

---

### 📚 7. Existing Spanish NLP Benchmarks & Shared Tasks

Many evaluation datasets have been produced by Spanish NLP shared tasks but are not yet integrated into LLM leaderboards:

| Benchmark / Shared Task | Organizer | Task | Years | Data Access |
|------------------------|-----------|------|-------|-------------|
| **TASS** (Sentiment Analysis in Spanish) | SEPLN | Sentiment, emotion, irony detection | 2012–present | `tass.sepln.org` — download page |
| **IberLEF** | SEPLN / Iberian NLP community | Multiple tasks: NER, sentiment, fake news, hate speech, etc. | 2019–present | `ceur-ws.org` proceedings + Zenodo |
| **Hackathon del ML4ALL** | Various | Spanish-specific ML challenges | Ongoing | GitHub repos |
| **PlanTL-GOB-ES** | Spanish government NLP initiative | STS-es, SQAC, EsExams, etc. | 2020–present | `huggingface.co/PlanTL-GOB-ES` (some gated) |
| **HiTZ evaluation suite** | HiTZ (Basque) | EusExams, EusProficiency, BertaQA, etc. | 2022–present | `huggingface.co/HiTZ` |
| **AINA / BSC** | BSC (Catalan) | CatalanQA, caBreu, CoQCat, etc. | 2023–present | `huggingface.co/projecte-aina` |
| **ILENIA / USC** | USC (Galician) | GalCoLA, summarization, parafrases | 2023–present | `huggingface.co/proxectonos` |
| **MedExpQA** | Unknown (mentioned in paper) | Medical explanatory QA in Spanish | — | Contact authors |
| **Conan EUS** | HiTZ | Counter-narrative generation (Basque/Spanish) | — | Contact authors |
| **H4rmony Eval** | Unknown | Ethics evaluation in Spanish | — | Contact authors |
| **VeritasQA** | Unknown | Truthfulness QA in ES/CA/GL | — | Contact authors |

---

### 🏛️ 8. Institutional & Government Data Portals

These portals contain raw data that could be transformed into evaluation datasets:

| Portal | Institution | Content | How to Use |
|--------|-------------|---------|------------|
| **datos.gob.es** | Spanish open data portal | Datasets from all ministries | Filter for text-heavy datasets; generate QA |
| **transparencia.gob.es** | Government transparency | Reports, budgets, contracts | Summarization, NLI, document QA |
| **boe.es** | Boletín Oficial del Estado | All Spanish legislation | Legal comprehension, summarization |
| **senado.es / congreso.es** | Parliament | Debates, bills, questions | Political reasoning, summarization |
| **ine.es** | National Statistics Institute | Statistical reports, surveys | Data-to-text generation, QA |
| **aemet.es** | Meteorology agency | Weather forecasts, alerts | Domain-specific text generation |
| **eldiario.es / elpais.com / 20minutos.es** | Press (archives) | News articles (with paywalls) | Summarization, NLI, but copyright-restricted |
| **rtve.es / rtve.es/alacarta** | Public broadcaster | TV/radio transcripts, subtitles | Public service content; potential for speech-to-text eval |
| **bne.es** | National Library | Digitized books, newspapers | Historical Spanish text; OCR quality varies |

---

### 🎯 9. Recommended Priorities for Dataset Expansion

Based on the research above, we recommend the following priority order for expanding La Leaderboard's cultural and linguistic coverage:

| Priority | Dataset | Effort | Impact | Lead |
|----------|---------|--------|--------|------|
| 🔴 **P0** | **EBAU/Selectividad digitization** (all 17 CCAA, 2015–2025) | High (scraping + PDF parsing + manual review) | Very High — tests general knowledge + reasoning in Spanish | Community + NLP Spain |
| 🔴 **P0** | **Spain Cultural Knowledge Benchmark** (*IberiaCult*) | Medium (human annotation + AI generation) | Very High — unique cultural evaluation | Cultural institutions + NLP Spain |
| 🟡 **P1** | **Expand CasiMedicos** with 2023–2025 MIR/EIR/QIR/FIR/PIR | Low-Medium (update existing pipeline) | High — medical reasoning in Spanish | HiTZ / medical community |
| 🟡 **P1** | **Legal benchmark** from BOE + CENDOJ + bar exam prep | Medium | High — legal reasoning | Law faculties + NLP Spain |
| 🟡 **P1** | **PAES Chile + LATAM exams** | Medium | High — LATAM variety coverage | LATAM NLP communities |
| 🟢 **P2** | **Aragonese / Asturian / Occitan benchmarks** | Medium-High (corpus creation) | Medium — linguistic diversity | Regional language academies |
| 🟢 **P2** | **TASS / IberLEF task integration** | Low (data already exists) | Medium | SEPLN community |
| 🔵 **P3** | **Valencian-specific tasks** (distinct from Catalan) | Low-Medium | Medium — regional identity | AINA / Valencia NLP |
| 🔵 **P3** | **Portuguese (Brazil + Portugal) benchmarks** | Low-Medium | Medium — Lusophone coverage | Portugal / Brazil NLP |

---

### 🤝 How to Contribute a Dataset

1. **Check** if the dataset is already on HuggingFace Hub or listed above
2. **Prepare** the data in one of these formats:
   - **MCQ**: `{"question": "...", "choices": ["A", "B", "C", "D"], "answer": 0, ...}`
   - **QA**: `{"question": "...", "context": "...", "answers": ["..."], ...}`
   - **NLI**: `{"premise": "...", "hypothesis": "...", "label": "entailment|neutral|contradiction"}`
   - **Summarization**: `{"article": "...", "summary": "..."}`
3. **Create** a HuggingFace Dataset repo with a Dataset Card
4. **Write** a YAML task definition for `lm-eval-harness` (see examples above)
5. **Open** a discussion on [La Leaderboard v2 Community](https://huggingface.co/spaces/pauvanbr/la-leaderboard-v2/discussions) or submit a PR

---

### 📖 References & Further Reading

| Paper / Resource | Relevance |
|-----------------|-----------|
| *La Leaderboard* (ACL 2025, arXiv:2507.00999) | Original leaderboard methodology |
| *IberBench* (arXiv:2504.16921) | Complementary Iberian benchmark with 101 datasets |
| *LLMzSzŁ* (arXiv:2501.02266) | Polish national exam digitization pipeline — model for EBAU |
| *KMMLU* / *Open Ko-LLM* (arXiv:2402.11548, 2410.12445) | Korean exam benchmarks + private test sets |
| *CulturalBench* (arXiv:2410.02677) | Methodology for cultural knowledge benchmarks |
| *BLEnD* (arXiv:2406.09948) | Multicultural QA across 16 countries |
| *BertaQA* (arXiv:2406.07302) | Basque cultural trivia — model for regional cultural QA |
| *EXAMS* (EMNLP 2020, arXiv:2011.03080) | Multilingual high school exams (includes 235 Spanish questions) |
| *CasiMedicos* (arXiv:2503.00025) | Spanish MIR evaluation |
| *PARIKSHA* (arXiv:2406.15053) | Human-LLM evaluator agreement on multicultural data |

---

*This README and annex are living documents. If you know of a dataset, exam, or benchmark source not listed here, please open a discussion or PR!* 💛