Spaces:
Configuration error
Configuration error
Upload README.md
Browse files
README.md
CHANGED
|
@@ -1,13 +1,339 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
-
|
|
|
|
| 1 |
+
# La Leaderboard v2 — LLM Leaderboard for Ibero-American Languages
|
| 2 |
+
|
| 3 |
+
[](https://aclanthology.org/2025.acl-long.1561/)
|
| 4 |
+
[](https://huggingface.co/spaces/pauvanbr/la-leaderboard-v2)
|
| 5 |
+
[](https://opensource.org/licenses/Apache-2.0)
|
| 6 |
+
|
| 7 |
+
**La Leaderboard** is the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. This is the **v2** update with an upgraded evaluation framework, expanded language support, and an open roadmap for culturally-relevant datasets.
|
| 8 |
+
|
| 9 |
+
## 🚀 What's New in v2
|
| 10 |
+
|
| 11 |
+
- **Modern lm-evaluation-harness**: Upgraded from a legacy fork to `lm-eval>=0.4.11` with YAML-based task definitions and the modern CLI (`lm-eval run`)
|
| 12 |
+
- **Expanded language support**: Added Valencian (VA) and Portuguese (PT) language tabs
|
| 13 |
+
- **Updated dependencies**: Transformers 4.51, Gradio 5.25, Datasets 3.5, Python 3.11+
|
| 14 |
+
- **Extended precision support**: 8bit, 4bit, GPTQ
|
| 15 |
+
- **Open dataset roadmap**: New README annex documenting potential culturally-relevant datasets for Spain and LATAM
|
| 16 |
+
|
| 17 |
+
## 📊 Architecture
|
| 18 |
+
|
| 19 |
+
```
|
| 20 |
+
la-leaderboard-v2/ # Gradio Space (frontend)
|
| 21 |
+
la-leaderboard-v2-requests/ # HF Dataset (evaluation queue)
|
| 22 |
+
la-leaderboard-v2-results/ # HF Dataset (evaluation results)
|
| 23 |
+
```
|
| 24 |
+
|
| 25 |
+
### Frontend (this repo)
|
| 26 |
+
- `app.py` — Gradio UI with tabs: Summary, ES, CA, EU, GL, VA, PT, Time/CO2, Info, Tasks, Submit
|
| 27 |
+
- `src/` — Data processing, leaderboard rendering, submission handling
|
| 28 |
+
- `tasks/` — Task registry (CSV + generated JSON/YAML for harness)
|
| 29 |
+
|
| 30 |
+
### Backend (separate Space recommended)
|
| 31 |
+
- Polls `requests` dataset for PENDING evaluations
|
| 32 |
+
- Runs `lm-eval run` with `--include_path` pointing to custom task YAMLs
|
| 33 |
+
- Pushes results to `results` dataset
|
| 34 |
+
|
| 35 |
+
## 🛠️ Reproducibility
|
| 36 |
+
|
| 37 |
+
### Install modern lm-eval-harness
|
| 38 |
+
|
| 39 |
+
```bash
|
| 40 |
+
pip install lm-eval>=0.4.11
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
### Run full leaderboard evaluation
|
| 44 |
+
|
| 45 |
+
```bash
|
| 46 |
+
lm-eval run --model hf \
|
| 47 |
+
--model_args "pretrained=<your_model>,revision=<rev>,dtype=<dtype>" \
|
| 48 |
+
--tasks=laleaderboard \
|
| 49 |
+
--num_fewshot=5 \
|
| 50 |
+
--device="cuda:0" \
|
| 51 |
+
--batch_size=auto \
|
| 52 |
+
--output_path=<output_path>
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
### Run single-language evaluation
|
| 56 |
+
|
| 57 |
+
```bash
|
| 58 |
+
lm-eval run --model hf \
|
| 59 |
+
--model_args "pretrained=<your_model>,revision=<rev>,dtype=<dtype>" \
|
| 60 |
+
--tasks=laleaderboard_es \
|
| 61 |
+
--num_fewshot=5 \
|
| 62 |
+
--device="cuda:0" \
|
| 63 |
+
--batch_size=auto
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
Supported language suffixes: `es`, `ca`, `eu`, `gl`, `va`, `pt`.
|
| 67 |
+
|
| 68 |
+
### Validate custom tasks
|
| 69 |
+
|
| 70 |
+
```bash
|
| 71 |
+
lm-eval validate --tasks my_custom_task --include_path ./custom_tasks
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
## 📁 Task Registry
|
| 75 |
+
|
| 76 |
+
Tasks are defined in `tasks/tasks.csv`. Run `python tasks/generate.py` to regenerate:
|
| 77 |
+
- `tasks/backend.json` — Flat list of harness task names for the backend
|
| 78 |
+
- `tasks/dummy_results.json` — Template result JSON for new submissions
|
| 79 |
+
- `tasks/harness.yaml` — Harness benchmark group definition
|
| 80 |
+
|
| 81 |
+
## 🧩 Custom Task Format (YAML)
|
| 82 |
+
|
| 83 |
+
The modern harness uses YAML-based task configs. Example for a Spanish exam task:
|
| 84 |
+
|
| 85 |
+
```yaml
|
| 86 |
+
tag:
|
| 87 |
+
- multiple_choice
|
| 88 |
+
task: ebas_matematicas
|
| 89 |
+
dataset_path: org/spanish-exam-dataset
|
| 90 |
+
dataset_name: null
|
| 91 |
+
output_type: multiple_choice
|
| 92 |
+
training_split: train
|
| 93 |
+
validation_split: validation
|
| 94 |
+
test_split: null
|
| 95 |
+
doc_to_text: "Pregunta de examen: {{question}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nRespuesta:"
|
| 96 |
+
doc_to_target: "{{answer}}"
|
| 97 |
+
doc_to_choice: ["A", "B", "C", "D"]
|
| 98 |
+
metric_list:
|
| 99 |
+
- metric: acc
|
| 100 |
+
aggregation: mean
|
| 101 |
+
higher_is_better: true
|
| 102 |
+
metadata:
|
| 103 |
+
version: 1.0
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
Place custom task YAMLs in a directory and pass `--include_path ./my_tasks` to `lm-eval run`.
|
| 107 |
+
|
| 108 |
+
## 📎 Annex — Potential Datasets for Spanish Cultural & Linguistic Evaluation
|
| 109 |
+
|
| 110 |
+
> This annex documents datasets, benchmarks, and data sources that could **add value** to La Leaderboard by capturing the cultural, linguistic, and idiosyncratic richness of Spain and Latin America. Not all are publicly available or digitized yet; this is a living roadmap for community contributions.
|
| 111 |
+
|
| 112 |
+
---
|
| 113 |
+
|
| 114 |
+
### 🇪🇸 1. Spanish Government & Competitive Exams (Digitization Required)
|
| 115 |
+
|
| 116 |
+
Following the successful model of **LLMzSzŁ** (Polish national exams) and **KMMLU** (Korean CSAT/bar exams), Spain has a rich ecosystem of standardized competitive exams that are ideal for LLM evaluation:
|
| 117 |
+
|
| 118 |
+
| Exam | Description | Source / How to Obtain | Status | Estimated Size |
|
| 119 |
+
|------|-------------|----------------------|--------|----------------|
|
| 120 |
+
| **EBAU / Selectividad / PAU / EvAU / PCE** | University entrance exams by autonomous community. Subjects: Maths, Physics, Chemistry, History, Spanish Language, English, Philosophy. | Each CCAA publishes PDFs on their education portals (e.g., `educacion.jccm.es`, `ensenyament.gencat.cat`, `juntadeandalucia.es/educacion`). Also aggregated on `examenesdepau.com`, `muchosexamenes.com`. | ⚠️ Needs PDF digitization (PyPDF + regex pipeline) | ~5,000–10,000 questions/year across 17 CCAA × 10 years |
|
| 121 |
+
| **Oposiciones** (Civil Service) | Competitive exams for public administration positions: administrative, teaching, nursing, firefighter, police. Extremely diverse subject matter. | `boe.es` (convocatorias), `aprende.gob.es`, community portals. Some already digitized on preparation platforms. | ⚠️ Fragmented; needs community scraping + OCR for image-based PDFs | Highly variable; potentially 20,000+ questions |
|
| 122 |
+
| **MIR** (Medical Residency) | Already partially digitized by `casimedicos.com` and on HF as `HiTZ/casimedicos-exp`. | `casimedicos.com`, `mirial.es`, Ministry of Health. | ✅ Partially available; needs 2023–2025 updates | ~3,000–5,000 questions/year |
|
| 123 |
+
| **EIR** (Nursing Residency) | Similar structure to MIR, for nursing. | `casimedicos.com` (nursing section), Ministry of Health. | ⚠️ Needs digitization | ~2,000 questions/year |
|
| 124 |
+
| **QIR** (Pharmacy), **FIR** (Physiotherapy), **PIR** (Psychology) | Specialized medical residency exams. | Same sources as MIR. | ⚠️ Needs digitization | ~1,000–2,000 each/year |
|
| 125 |
+
| **PEvAU / EBAU Andalucía** | Specific Andalusian university entrance exam with public PDF archive. | `juntadeandalucia.es/educacion` → "Exámenes de acceso a la universidad" | ✅ Public PDFs available | ~500 questions/year |
|
| 126 |
+
| **Acceso Mayor 25/45** | University access for adults over 25/45. Questions test general knowledge + specific subjects. | CCAA education portals. | ⚠️ Needs digitization | ~1,000 questions/year |
|
| 127 |
+
| **FP Grado Superior** | Vocational training access exams. Technical + general knowledge. | CCAA education portals, `todofp.es`. | ⚠️ Needs digitization | ~2,000 questions/year |
|
| 128 |
+
|
| 129 |
+
**Digitization pipeline** (based on LLMzSzŁ methodology):
|
| 130 |
+
1. Bulk-download PDFs by year/subject from official portals
|
| 131 |
+
2. Extract text layer with `PyPDF2`/`pdfplumber`
|
| 132 |
+
3. Regex-parse: question number → question text → options A/B/C/D → correct answer
|
| 133 |
+
4. Handle image-based PDFs with OCR (`Tesseract`, `easyOCR`, `pymupdf`)
|
| 134 |
+
5. Manual review of answer keys (often in separate PDFs)
|
| 135 |
+
6. Output: `{"question": "...", "choices": [...], "answer": 0, "subject": "Historia", "year": 2023, "ccaa": "Andalucía"}`
|
| 136 |
+
|
| 137 |
+
**Important**: To avoid data contamination, test splits should be held back and NOT published until after models are evaluated, following the KMMLU private-test-set model.
|
| 138 |
+
|
| 139 |
+
---
|
| 140 |
+
|
| 141 |
+
### 🏛️ 2. Legal Datasets (Spain-Specific)
|
| 142 |
+
|
| 143 |
+
| Dataset | Description | Source | Status | Format |
|
| 144 |
+
|---------|-------------|--------|--------|--------|
|
| 145 |
+
| **BOE Legal Corpus** | Full text of Spanish Official State Gazette. Massive legal text resource. | `boe.es` (daily bulletin), `Pepere45/spanish-boe-legal-corpus` on HF | ⚠️ Gated / needs request | Text / XML |
|
| 146 |
+
| **CENDOJ** (Centro de Documentación Judicial) | Spanish case law (jurisprudencia) from Supreme Court, National Court, etc. | `cej-mjusticia.es` — has API for XML download | ✅ API available | XML |
|
| 147 |
+
| **BOE-XSUM** | Extreme summarization of BOE documents into plain Spanish. Tests legal comprehension + simplification. | Paper: arXiv:2509.24908 | ⚠️ Needs re-creation or permission | Text pairs |
|
| 148 |
+
| **PlanTL-GOB-ES lm-legal-es** | Spanish legal-domain language model + corpus from BOE, CENDOJ. | GitHub: `PlanTL-GOB-ES/lm-legal-es` | ⚠️ Partially available | Text |
|
| 149 |
+
| **SpaLawEx** | Already in La Leaderboard! Spanish Law School Access Exams. | `LenguajeNaturalAI/examenes_abogacia` on HF | ✅ Available | MCQ |
|
| 150 |
+
| **European Court of Human Rights (Spanish)** | Case descriptions + violation/no-violation labels. | `icLR/echr_cases` or build from HUDOC | ⚠️ Needs filtering for Spanish | Text + label |
|
| 151 |
+
| **Spanish Constitutional Court rulings** | TC rulings with subject classification. | `tribunalconstitucional.es` | ⚠️ Needs scraping + annotation | Text |
|
| 152 |
+
| **EU Law (Spanish)** | EUR-Lex corpus in Spanish. EU directives, regulations, decisions. | `eur-lex.europa.eu` | ✅ Public; needs download pipeline | Text / XML |
|
| 153 |
+
|
| 154 |
+
**Suggested task formats**:
|
| 155 |
+
- **Legal MCQ**: Extract multiple-choice questions from bar exam prep books (e.g., *Civitas*, *Tecnos*). Copyright-restricted but publishers may collaborate.
|
| 156 |
+
- **Legal NLI**: Given a BOE article + a citizen query, determine if the article entails/contradicts/is neutral to the query.
|
| 157 |
+
- **Legal summarization**: Simplify a BOE resolution into layperson Spanish.
|
| 158 |
+
- **Legal entailment**: Given a CENDOJ ruling summary, determine which legal principle applies.
|
| 159 |
+
|
| 160 |
+
---
|
| 161 |
+
|
| 162 |
+
### 🏥 3. Medical & Health Datasets
|
| 163 |
+
|
| 164 |
+
| Dataset | Description | Source | Status |
|
| 165 |
+
|---------|-------------|--------|--------|
|
| 166 |
+
| **CasiMedicos / HEAD-QA v2** | Already in La Leaderboard! MIR medical exam with explanations. | `HiTZ/casimedicos-exp` on HF | ✅ Available, multilingual (es/en/fr/it) |
|
| 167 |
+
| **MedExpQA** (ES) | Medical explanatory QA with reasoning chains. | Mentioned in La Leaderboard paper Table 2; check with authors | ⚠️ Needs confirmation |
|
| 168 |
+
| **Spanish clinical notes (anonymized)** | Real clinical notes for NER/entity linking. | PlanTL-GOB-ES, BSC projects | ⚠️ Privacy-restricted |
|
| 169 |
+
| **Spanish medical Wikipedia** | Medical articles from Spanish Wikipedia. | `es.wikipedia.org` (CC BY-SA) | ✅ Public; needs QA generation |
|
| 170 |
+
| **Medicina en español (MEDLINE)** | Spanish-language biomedical abstracts. | PubMed/MEDLINE Spanish filter | ✅ Public |
|
| 171 |
+
| **Farmacovigilancia (AEMPS)** | Spanish drug safety reports. | `aemps.gob.es` | ⚠️ Needs request + anonymization |
|
| 172 |
+
|
| 173 |
+
**Suggested medical tasks**:
|
| 174 |
+
- **Differential diagnosis**: Given symptoms in Spanish, select most likely diagnosis (from MIR case descriptions)
|
| 175 |
+
- **Drug interaction**: Given two medications, determine if there's a known interaction (from AEMPS data)
|
| 176 |
+
- **Patient-doctor dialogue**: Evaluate if model can produce culturally appropriate Spanish medical explanations
|
| 177 |
+
|
| 178 |
+
---
|
| 179 |
+
|
| 180 |
+
### 🎭 4. Cultural Knowledge & Idiosyncrasy Datasets (High Priority Gap)
|
| 181 |
+
|
| 182 |
+
This is the **biggest gap** in current benchmarks. No existing leaderboard adequately tests LLM knowledge of Spanish culture, history, traditions, gastronomy, sports, and regional idiosyncrasies. Inspired by **CulturalBench** (16 countries, but Spain not included) and **BertaQA** (Basque cultural trivia):
|
| 183 |
+
|
| 184 |
+
#### 4.1 Spain-Specific Cultural Knowledge Benchmark (Proposed: *IberiaCult*)
|
| 185 |
+
|
| 186 |
+
**Methodology** (based on CulturalBench / BLEnD):
|
| 187 |
+
1. Seed queries on Spanish cultural topics
|
| 188 |
+
2. Human-AI collaborative generation: native Spaniards write questions, LLMs generate distractors, humans validate
|
| 189 |
+
3. 5-annotator majority vote for correct answers
|
| 190 |
+
4. MinHash deduplication against major pretraining corpora
|
| 191 |
+
|
| 192 |
+
| Topic Domain | Example Questions | Source of Inspiration |
|
| 193 |
+
|--------------|-------------------|----------------------|
|
| 194 |
+
| **History** | "¿Quién fue el primer presidente de la democracia española?" / "¿En qué año se produjo el 23-F?" | Spanish history curriculum, Wikipedia |
|
| 195 |
+
| **Geography** | "¿Cuál es la capital de La Rioja?" / "¿Qué río separa España de Portugal?" | School geography, IGN maps |
|
| 196 |
+
| **Traditions & Fiestas** | "¿En qué ciudad se celebra las Fallas?" / "¿Qué se arroja en la Tomatina?" | Official fiesta calendars, UNESCO intangible heritage |
|
| 197 |
+
| **Gastronomy** | "¿Cuál es el ingrediente principal del gazpacho andaluz?" / "¿De qué zona es el vino albariño?" | Denominaciones de Origen (MAPA), Spanish cookbooks |
|
| 198 |
+
| **Sports** | "¿Cuántas Ligas ha ganado el Real Madrid?" / "¿Quién ganó la medalla de oro en 100m lisos en Barcelona 92?" | RFEF, COE, press archives |
|
| 199 |
+
| **Politics & Constitution** | "¿Cuántos artículos tiene la Constitución española de 1978?" / "¿Qué comunidades autónomas tienen cooficialidad lingüística?" | Congreso.es, Constitutional text |
|
| 200 |
+
| **Art & Literature** | "¿Quién pintó 'Las Meninas'?" / "¿En qué ciudad nació Federico García Lorca?" | Museo del Prado, RAE, school curriculum |
|
| 201 |
+
| **Music & Cinema** | "¿Qué grupo español cantó 'La ciudad de los gatos'?" / "¿Quién dirigió 'El espíritu de la colmena'?" | RTVE archives, SGAE |
|
| 202 |
+
| **Daily Life & Idioms** | "¿Qué significa la expresión 'tener morro'?" / "¿A qué hora se suele cenar en España?" | RAE DLE, colloquial corpora, Reddit r/Spain |
|
| 203 |
+
| **Regional Knowledge** | "¿En qué provincia se encuentra el Parque Nacional de Ordesa?" / "¿Qué idioma se habla en el Val d'Aran?" | CCAA tourism + education portals |
|
| 204 |
+
|
| 205 |
+
**Potential data sources**:
|
| 206 |
+
- **Trivial Pursuit España** editions (question cards — copyright)
|
| 207 |
+
- **Pasapalabra** (Spanish TV quiz show — transcripts available)
|
| 208 |
+
- **Saber y Ganar** (TVE cultural quiz — public broadcaster, may have archives)
|
| 209 |
+
- **Spanish school textbooks** (history, geography, philosophy — copyright)
|
| 210 |
+
- **Wikipedia es** (CC BY-SA — can generate QA with modern pipelines)
|
| 211 |
+
- **Reddit r/AskSpain, r/Spain** threads (CC — filter for factual questions)
|
| 212 |
+
- **RTVE archives** (public TV — cultural programs with transcripts)
|
| 213 |
+
|
| 214 |
+
---
|
| 215 |
+
|
| 216 |
+
### 🌎 5. Latin American Cultural & Linguistic Varieties
|
| 217 |
+
|
| 218 |
+
| Dataset / Source | Language/Variety | Description | How to Obtain |
|
| 219 |
+
|------------------|-----------------|-------------|---------------|
|
| 220 |
+
| **PAES Chile** | es-CL | Chilean university entrance exam (Prueba de Acceso a la Educación Superior). Similar to EBAU. | `demre.cl` publishes PDFs with answer keys. |
|
| 221 |
+
| **ENEM Brasil** (Portuguese) | pt-BR | Brazilian national high school exam. Portuguese baseline for LATAM. | `enem.inep.gov.br` — microdata available. |
|
| 222 |
+
| **Mexican EXANI** | es-MX | Mexican university entrance exam (CENEVAL). | `ceneval.edu.mx` — some materials public. |
|
| 223 |
+
| **Colombian ICFES / Saber 11** | es-CO | Colombian standardized tests. | `icfes.gov.co` — public datasets. |
|
| 224 |
+
| **Argentine CBC / Ingreso** | es-AR | Argentine university entrance exams. | Each university (UBA, UNT, etc.) publishes exams. |
|
| 225 |
+
| **Voces Originarias** | Indigenous | Indigenous voices and languages of LATAM. | Mentioned in La Leaderboard paper; check with authors. |
|
| 226 |
+
| **AmericasNLP** | Indigenous | NLI, MT for Aymara, Guaraní, Quechua, Náhuatl, etc. | `americasnlp.org` + HF Hub |
|
| 227 |
+
| **Meta4XNLI** | Indigenous | Cross-lingual NLI for indigenous languages. | Paper: check authors for data access |
|
| 228 |
+
| **ASALE / RAE DLE** | es | Royal Spanish Academy Dictionary — authority on Spanish usage. | `rae.es` — API for definitions (for linguistic acceptability tasks) |
|
| 229 |
+
| **CREA / Corpus de Referencia del Español Actual** | es | RAE reference corpus of modern Spanish usage. | `rae.es` — may require academic agreement |
|
| 230 |
+
| **CORLEC / Corpus Oral de Lenguaje de Especialidad** | es | Spanish specialized language corpus. | UAM / UCM — academic access |
|
| 231 |
+
|
| 232 |
+
---
|
| 233 |
+
|
| 234 |
+
### 🗣️ 6. Regional & Minority Languages of Spain (Beyond Co-Official)
|
| 235 |
+
|
| 236 |
+
The current La Leaderboard covers Spanish, Catalan, Basque, and Galician. Spain has additional recognized languages that lack LLM benchmarks:
|
| 237 |
+
|
| 238 |
+
| Language | Status | Existing Resources | What's Needed |
|
| 239 |
+
|----------|--------|-------------------|---------------|
|
| 240 |
+
| **Aragonese** | Recognized (Aragón) | FLORES+ (WMT 2024): `oldi.org/cards/flores/arg_Latn` | Reading comprehension, QA, NLI benchmark |
|
| 241 |
+
| **Asturian / Bable** | Recognized (Asturias) | FLORES+: `oldi.org/cards/flores/ast_Latn`, PILAR project | Reading comprehension, QA |
|
| 242 |
+
| **Leonese** | Recognized (Castilla y León) | Very limited | Needs corpus creation from oral history projects |
|
| 243 |
+
| **Extremaduran** | Recognized (Extremadura) | Very limited | Needs community collection |
|
| 244 |
+
| **Fala** | Recognized (Extremadura) | Very limited | Needs community collection |
|
| 245 |
+
| **Occitan / Aranese** | Co-official in Val d'Aran (Catalonia) | FLORES+, some Aranese school materials | Reading comprehension, basic QA |
|
| 246 |
+
| **Caló** | Historical Romani variety in Spain | Very limited | Ethical considerations; work with Roma communities |
|
| 247 |
+
|
| 248 |
+
**Suggested approach**: Partner with regional language academies (Academia de la Llingua Asturiana, Academia Aragonesa de la Lengua) to digitize school textbooks and create reading comprehension + QA benchmarks.
|
| 249 |
+
|
| 250 |
+
---
|
| 251 |
+
|
| 252 |
+
### 📚 7. Existing Spanish NLP Benchmarks & Shared Tasks
|
| 253 |
+
|
| 254 |
+
Many evaluation datasets have been produced by Spanish NLP shared tasks but are not yet integrated into LLM leaderboards:
|
| 255 |
+
|
| 256 |
+
| Benchmark / Shared Task | Organizer | Task | Years | Data Access |
|
| 257 |
+
|------------------------|-----------|------|-------|-------------|
|
| 258 |
+
| **TASS** (Sentiment Analysis in Spanish) | SEPLN | Sentiment, emotion, irony detection | 2012–present | `tass.sepln.org` — download page |
|
| 259 |
+
| **IberLEF** | SEPLN / Iberian NLP community | Multiple tasks: NER, sentiment, fake news, hate speech, etc. | 2019–present | `ceur-ws.org` proceedings + Zenodo |
|
| 260 |
+
| **Hackathon del ML4ALL** | Various | Spanish-specific ML challenges | Ongoing | GitHub repos |
|
| 261 |
+
| **PlanTL-GOB-ES** | Spanish government NLP initiative | STS-es, SQAC, EsExams, etc. | 2020–present | `huggingface.co/PlanTL-GOB-ES` (some gated) |
|
| 262 |
+
| **HiTZ evaluation suite** | HiTZ (Basque) | EusExams, EusProficiency, BertaQA, etc. | 2022–present | `huggingface.co/HiTZ` |
|
| 263 |
+
| **AINA / BSC** | BSC (Catalan) | CatalanQA, caBreu, CoQCat, etc. | 2023–present | `huggingface.co/projecte-aina` |
|
| 264 |
+
| **ILENIA / USC** | USC (Galician) | GalCoLA, summarization, parafrases | 2023–present | `huggingface.co/proxectonos` |
|
| 265 |
+
| **MedExpQA** | Unknown (mentioned in paper) | Medical explanatory QA in Spanish | — | Contact authors |
|
| 266 |
+
| **Conan EUS** | HiTZ | Counter-narrative generation (Basque/Spanish) | — | Contact authors |
|
| 267 |
+
| **H4rmony Eval** | Unknown | Ethics evaluation in Spanish | — | Contact authors |
|
| 268 |
+
| **VeritasQA** | Unknown | Truthfulness QA in ES/CA/GL | — | Contact authors |
|
| 269 |
+
|
| 270 |
---
|
| 271 |
+
|
| 272 |
+
### 🏛️ 8. Institutional & Government Data Portals
|
| 273 |
+
|
| 274 |
+
These portals contain raw data that could be transformed into evaluation datasets:
|
| 275 |
+
|
| 276 |
+
| Portal | Institution | Content | How to Use |
|
| 277 |
+
|--------|-------------|---------|------------|
|
| 278 |
+
| **datos.gob.es** | Spanish open data portal | Datasets from all ministries | Filter for text-heavy datasets; generate QA |
|
| 279 |
+
| **transparencia.gob.es** | Government transparency | Reports, budgets, contracts | Summarization, NLI, document QA |
|
| 280 |
+
| **boe.es** | Boletín Oficial del Estado | All Spanish legislation | Legal comprehension, summarization |
|
| 281 |
+
| **senado.es / congreso.es** | Parliament | Debates, bills, questions | Political reasoning, summarization |
|
| 282 |
+
| **ine.es** | National Statistics Institute | Statistical reports, surveys | Data-to-text generation, QA |
|
| 283 |
+
| **aemet.es** | Meteorology agency | Weather forecasts, alerts | Domain-specific text generation |
|
| 284 |
+
| **eldiario.es / elpais.com / 20minutos.es** | Press (archives) | News articles (with paywalls) | Summarization, NLI, but copyright-restricted |
|
| 285 |
+
| **rtve.es / rtve.es/alacarta** | Public broadcaster | TV/radio transcripts, subtitles | Public service content; potential for speech-to-text eval |
|
| 286 |
+
| **bne.es** | National Library | Digitized books, newspapers | Historical Spanish text; OCR quality varies |
|
| 287 |
+
|
| 288 |
+
---
|
| 289 |
+
|
| 290 |
+
### 🎯 9. Recommended Priorities for Dataset Expansion
|
| 291 |
+
|
| 292 |
+
Based on the research above, we recommend the following priority order for expanding La Leaderboard's cultural and linguistic coverage:
|
| 293 |
+
|
| 294 |
+
| Priority | Dataset | Effort | Impact | Lead |
|
| 295 |
+
|----------|---------|--------|--------|------|
|
| 296 |
+
| 🔴 **P0** | **EBAU/Selectividad digitization** (all 17 CCAA, 2015–2025) | High (scraping + PDF parsing + manual review) | Very High — tests general knowledge + reasoning in Spanish | Community + NLP Spain |
|
| 297 |
+
| 🔴 **P0** | **Spain Cultural Knowledge Benchmark** (*IberiaCult*) | Medium (human annotation + AI generation) | Very High — unique cultural evaluation | Cultural institutions + NLP Spain |
|
| 298 |
+
| 🟡 **P1** | **Expand CasiMedicos** with 2023–2025 MIR/EIR/QIR/FIR/PIR | Low-Medium (update existing pipeline) | High — medical reasoning in Spanish | HiTZ / medical community |
|
| 299 |
+
| 🟡 **P1** | **Legal benchmark** from BOE + CENDOJ + bar exam prep | Medium | High — legal reasoning | Law faculties + NLP Spain |
|
| 300 |
+
| 🟡 **P1** | **PAES Chile + LATAM exams** | Medium | High — LATAM variety coverage | LATAM NLP communities |
|
| 301 |
+
| 🟢 **P2** | **Aragonese / Asturian / Occitan benchmarks** | Medium-High (corpus creation) | Medium — linguistic diversity | Regional language academies |
|
| 302 |
+
| 🟢 **P2** | **TASS / IberLEF task integration** | Low (data already exists) | Medium | SEPLN community |
|
| 303 |
+
| 🔵 **P3** | **Valencian-specific tasks** (distinct from Catalan) | Low-Medium | Medium — regional identity | AINA / Valencia NLP |
|
| 304 |
+
| 🔵 **P3** | **Portuguese (Brazil + Portugal) benchmarks** | Low-Medium | Medium — Lusophone coverage | Portugal / Brazil NLP |
|
| 305 |
+
|
| 306 |
+
---
|
| 307 |
+
|
| 308 |
+
### 🤝 How to Contribute a Dataset
|
| 309 |
+
|
| 310 |
+
1. **Check** if the dataset is already on HuggingFace Hub or listed above
|
| 311 |
+
2. **Prepare** the data in one of these formats:
|
| 312 |
+
- **MCQ**: `{"question": "...", "choices": ["A", "B", "C", "D"], "answer": 0, ...}`
|
| 313 |
+
- **QA**: `{"question": "...", "context": "...", "answers": ["..."], ...}`
|
| 314 |
+
- **NLI**: `{"premise": "...", "hypothesis": "...", "label": "entailment|neutral|contradiction"}`
|
| 315 |
+
- **Summarization**: `{"article": "...", "summary": "..."}`
|
| 316 |
+
3. **Create** a HuggingFace Dataset repo with a Dataset Card
|
| 317 |
+
4. **Write** a YAML task definition for `lm-eval-harness` (see examples above)
|
| 318 |
+
5. **Open** a discussion on [La Leaderboard v2 Community](https://huggingface.co/spaces/pauvanbr/la-leaderboard-v2/discussions) or submit a PR
|
| 319 |
+
|
| 320 |
+
---
|
| 321 |
+
|
| 322 |
+
### 📖 References & Further Reading
|
| 323 |
+
|
| 324 |
+
| Paper / Resource | Relevance |
|
| 325 |
+
|-----------------|-----------|
|
| 326 |
+
| *La Leaderboard* (ACL 2025, arXiv:2507.00999) | Original leaderboard methodology |
|
| 327 |
+
| *IberBench* (arXiv:2504.16921) | Complementary Iberian benchmark with 101 datasets |
|
| 328 |
+
| *LLMzSzŁ* (arXiv:2501.02266) | Polish national exam digitization pipeline — model for EBAU |
|
| 329 |
+
| *KMMLU* / *Open Ko-LLM* (arXiv:2402.11548, 2410.12445) | Korean exam benchmarks + private test sets |
|
| 330 |
+
| *CulturalBench* (arXiv:2410.02677) | Methodology for cultural knowledge benchmarks |
|
| 331 |
+
| *BLEnD* (arXiv:2406.09948) | Multicultural QA across 16 countries |
|
| 332 |
+
| *BertaQA* (arXiv:2406.07302) | Basque cultural trivia — model for regional cultural QA |
|
| 333 |
+
| *EXAMS* (EMNLP 2020, arXiv:2011.03080) | Multilingual high school exams (includes 235 Spanish questions) |
|
| 334 |
+
| *CasiMedicos* (arXiv:2503.00025) | Spanish MIR evaluation |
|
| 335 |
+
| *PARIKSHA* (arXiv:2406.15053) | Human-LLM evaluator agreement on multicultural data |
|
| 336 |
+
|
| 337 |
---
|
| 338 |
|
| 339 |
+
*This README and annex are living documents. If you know of a dataset, exam, or benchmark source not listed here, please open a discussion or PR!* 💛
|