pauvanbr commited on
Commit
80603f8
·
verified ·
1 Parent(s): 4b8d216

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +336 -10
README.md CHANGED
@@ -1,13 +1,339 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: La Leaderboard V2
3
- emoji: 😻
4
- colorFrom: indigo
5
- colorTo: red
6
- sdk: gradio
7
- sdk_version: 6.14.0
8
- python_version: '3.13'
9
- app_file: app.py
10
- pinned: false
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
+ # La Leaderboard v2 — LLM Leaderboard for Ibero-American Languages
2
+
3
+ [![Paper](https://img.shields.io/badge/ACL%202025-Paper-blue)](https://aclanthology.org/2025.acl-long.1561/)
4
+ [![Space](https://img.shields.io/badge/🤗%20Space-La%20Leaderboard-orange)](https://huggingface.co/spaces/pauvanbr/la-leaderboard-v2)
5
+ [![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
6
+
7
+ **La Leaderboard** is the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. This is the **v2** update with an upgraded evaluation framework, expanded language support, and an open roadmap for culturally-relevant datasets.
8
+
9
+ ## 🚀 What's New in v2
10
+
11
+ - **Modern lm-evaluation-harness**: Upgraded from a legacy fork to `lm-eval>=0.4.11` with YAML-based task definitions and the modern CLI (`lm-eval run`)
12
+ - **Expanded language support**: Added Valencian (VA) and Portuguese (PT) language tabs
13
+ - **Updated dependencies**: Transformers 4.51, Gradio 5.25, Datasets 3.5, Python 3.11+
14
+ - **Extended precision support**: 8bit, 4bit, GPTQ
15
+ - **Open dataset roadmap**: New README annex documenting potential culturally-relevant datasets for Spain and LATAM
16
+
17
+ ## 📊 Architecture
18
+
19
+ ```
20
+ la-leaderboard-v2/ # Gradio Space (frontend)
21
+ la-leaderboard-v2-requests/ # HF Dataset (evaluation queue)
22
+ la-leaderboard-v2-results/ # HF Dataset (evaluation results)
23
+ ```
24
+
25
+ ### Frontend (this repo)
26
+ - `app.py` — Gradio UI with tabs: Summary, ES, CA, EU, GL, VA, PT, Time/CO2, Info, Tasks, Submit
27
+ - `src/` — Data processing, leaderboard rendering, submission handling
28
+ - `tasks/` — Task registry (CSV + generated JSON/YAML for harness)
29
+
30
+ ### Backend (separate Space recommended)
31
+ - Polls `requests` dataset for PENDING evaluations
32
+ - Runs `lm-eval run` with `--include_path` pointing to custom task YAMLs
33
+ - Pushes results to `results` dataset
34
+
35
+ ## 🛠️ Reproducibility
36
+
37
+ ### Install modern lm-eval-harness
38
+
39
+ ```bash
40
+ pip install lm-eval>=0.4.11
41
+ ```
42
+
43
+ ### Run full leaderboard evaluation
44
+
45
+ ```bash
46
+ lm-eval run --model hf \
47
+ --model_args "pretrained=<your_model>,revision=<rev>,dtype=<dtype>" \
48
+ --tasks=laleaderboard \
49
+ --num_fewshot=5 \
50
+ --device="cuda:0" \
51
+ --batch_size=auto \
52
+ --output_path=<output_path>
53
+ ```
54
+
55
+ ### Run single-language evaluation
56
+
57
+ ```bash
58
+ lm-eval run --model hf \
59
+ --model_args "pretrained=<your_model>,revision=<rev>,dtype=<dtype>" \
60
+ --tasks=laleaderboard_es \
61
+ --num_fewshot=5 \
62
+ --device="cuda:0" \
63
+ --batch_size=auto
64
+ ```
65
+
66
+ Supported language suffixes: `es`, `ca`, `eu`, `gl`, `va`, `pt`.
67
+
68
+ ### Validate custom tasks
69
+
70
+ ```bash
71
+ lm-eval validate --tasks my_custom_task --include_path ./custom_tasks
72
+ ```
73
+
74
+ ## 📁 Task Registry
75
+
76
+ Tasks are defined in `tasks/tasks.csv`. Run `python tasks/generate.py` to regenerate:
77
+ - `tasks/backend.json` — Flat list of harness task names for the backend
78
+ - `tasks/dummy_results.json` — Template result JSON for new submissions
79
+ - `tasks/harness.yaml` — Harness benchmark group definition
80
+
81
+ ## 🧩 Custom Task Format (YAML)
82
+
83
+ The modern harness uses YAML-based task configs. Example for a Spanish exam task:
84
+
85
+ ```yaml
86
+ tag:
87
+ - multiple_choice
88
+ task: ebas_matematicas
89
+ dataset_path: org/spanish-exam-dataset
90
+ dataset_name: null
91
+ output_type: multiple_choice
92
+ training_split: train
93
+ validation_split: validation
94
+ test_split: null
95
+ doc_to_text: "Pregunta de examen: {{question}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nRespuesta:"
96
+ doc_to_target: "{{answer}}"
97
+ doc_to_choice: ["A", "B", "C", "D"]
98
+ metric_list:
99
+ - metric: acc
100
+ aggregation: mean
101
+ higher_is_better: true
102
+ metadata:
103
+ version: 1.0
104
+ ```
105
+
106
+ Place custom task YAMLs in a directory and pass `--include_path ./my_tasks` to `lm-eval run`.
107
+
108
+ ## 📎 Annex — Potential Datasets for Spanish Cultural & Linguistic Evaluation
109
+
110
+ > This annex documents datasets, benchmarks, and data sources that could **add value** to La Leaderboard by capturing the cultural, linguistic, and idiosyncratic richness of Spain and Latin America. Not all are publicly available or digitized yet; this is a living roadmap for community contributions.
111
+
112
+ ---
113
+
114
+ ### 🇪🇸 1. Spanish Government & Competitive Exams (Digitization Required)
115
+
116
+ Following the successful model of **LLMzSzŁ** (Polish national exams) and **KMMLU** (Korean CSAT/bar exams), Spain has a rich ecosystem of standardized competitive exams that are ideal for LLM evaluation:
117
+
118
+ | Exam | Description | Source / How to Obtain | Status | Estimated Size |
119
+ |------|-------------|----------------------|--------|----------------|
120
+ | **EBAU / Selectividad / PAU / EvAU / PCE** | University entrance exams by autonomous community. Subjects: Maths, Physics, Chemistry, History, Spanish Language, English, Philosophy. | Each CCAA publishes PDFs on their education portals (e.g., `educacion.jccm.es`, `ensenyament.gencat.cat`, `juntadeandalucia.es/educacion`). Also aggregated on `examenesdepau.com`, `muchosexamenes.com`. | ⚠️ Needs PDF digitization (PyPDF + regex pipeline) | ~5,000–10,000 questions/year across 17 CCAA × 10 years |
121
+ | **Oposiciones** (Civil Service) | Competitive exams for public administration positions: administrative, teaching, nursing, firefighter, police. Extremely diverse subject matter. | `boe.es` (convocatorias), `aprende.gob.es`, community portals. Some already digitized on preparation platforms. | ⚠️ Fragmented; needs community scraping + OCR for image-based PDFs | Highly variable; potentially 20,000+ questions |
122
+ | **MIR** (Medical Residency) | Already partially digitized by `casimedicos.com` and on HF as `HiTZ/casimedicos-exp`. | `casimedicos.com`, `mirial.es`, Ministry of Health. | ✅ Partially available; needs 2023–2025 updates | ~3,000–5,000 questions/year |
123
+ | **EIR** (Nursing Residency) | Similar structure to MIR, for nursing. | `casimedicos.com` (nursing section), Ministry of Health. | ⚠️ Needs digitization | ~2,000 questions/year |
124
+ | **QIR** (Pharmacy), **FIR** (Physiotherapy), **PIR** (Psychology) | Specialized medical residency exams. | Same sources as MIR. | ⚠️ Needs digitization | ~1,000–2,000 each/year |
125
+ | **PEvAU / EBAU Andalucía** | Specific Andalusian university entrance exam with public PDF archive. | `juntadeandalucia.es/educacion` → "Exámenes de acceso a la universidad" | ✅ Public PDFs available | ~500 questions/year |
126
+ | **Acceso Mayor 25/45** | University access for adults over 25/45. Questions test general knowledge + specific subjects. | CCAA education portals. | ⚠️ Needs digitization | ~1,000 questions/year |
127
+ | **FP Grado Superior** | Vocational training access exams. Technical + general knowledge. | CCAA education portals, `todofp.es`. | ⚠️ Needs digitization | ~2,000 questions/year |
128
+
129
+ **Digitization pipeline** (based on LLMzSzŁ methodology):
130
+ 1. Bulk-download PDFs by year/subject from official portals
131
+ 2. Extract text layer with `PyPDF2`/`pdfplumber`
132
+ 3. Regex-parse: question number → question text → options A/B/C/D → correct answer
133
+ 4. Handle image-based PDFs with OCR (`Tesseract`, `easyOCR`, `pymupdf`)
134
+ 5. Manual review of answer keys (often in separate PDFs)
135
+ 6. Output: `{"question": "...", "choices": [...], "answer": 0, "subject": "Historia", "year": 2023, "ccaa": "Andalucía"}`
136
+
137
+ **Important**: To avoid data contamination, test splits should be held back and NOT published until after models are evaluated, following the KMMLU private-test-set model.
138
+
139
+ ---
140
+
141
+ ### 🏛️ 2. Legal Datasets (Spain-Specific)
142
+
143
+ | Dataset | Description | Source | Status | Format |
144
+ |---------|-------------|--------|--------|--------|
145
+ | **BOE Legal Corpus** | Full text of Spanish Official State Gazette. Massive legal text resource. | `boe.es` (daily bulletin), `Pepere45/spanish-boe-legal-corpus` on HF | ⚠️ Gated / needs request | Text / XML |
146
+ | **CENDOJ** (Centro de Documentación Judicial) | Spanish case law (jurisprudencia) from Supreme Court, National Court, etc. | `cej-mjusticia.es` — has API for XML download | ✅ API available | XML |
147
+ | **BOE-XSUM** | Extreme summarization of BOE documents into plain Spanish. Tests legal comprehension + simplification. | Paper: arXiv:2509.24908 | ⚠️ Needs re-creation or permission | Text pairs |
148
+ | **PlanTL-GOB-ES lm-legal-es** | Spanish legal-domain language model + corpus from BOE, CENDOJ. | GitHub: `PlanTL-GOB-ES/lm-legal-es` | ⚠️ Partially available | Text |
149
+ | **SpaLawEx** | Already in La Leaderboard! Spanish Law School Access Exams. | `LenguajeNaturalAI/examenes_abogacia` on HF | ✅ Available | MCQ |
150
+ | **European Court of Human Rights (Spanish)** | Case descriptions + violation/no-violation labels. | `icLR/echr_cases` or build from HUDOC | ⚠️ Needs filtering for Spanish | Text + label |
151
+ | **Spanish Constitutional Court rulings** | TC rulings with subject classification. | `tribunalconstitucional.es` | ⚠️ Needs scraping + annotation | Text |
152
+ | **EU Law (Spanish)** | EUR-Lex corpus in Spanish. EU directives, regulations, decisions. | `eur-lex.europa.eu` | ✅ Public; needs download pipeline | Text / XML |
153
+
154
+ **Suggested task formats**:
155
+ - **Legal MCQ**: Extract multiple-choice questions from bar exam prep books (e.g., *Civitas*, *Tecnos*). Copyright-restricted but publishers may collaborate.
156
+ - **Legal NLI**: Given a BOE article + a citizen query, determine if the article entails/contradicts/is neutral to the query.
157
+ - **Legal summarization**: Simplify a BOE resolution into layperson Spanish.
158
+ - **Legal entailment**: Given a CENDOJ ruling summary, determine which legal principle applies.
159
+
160
+ ---
161
+
162
+ ### 🏥 3. Medical & Health Datasets
163
+
164
+ | Dataset | Description | Source | Status |
165
+ |---------|-------------|--------|--------|
166
+ | **CasiMedicos / HEAD-QA v2** | Already in La Leaderboard! MIR medical exam with explanations. | `HiTZ/casimedicos-exp` on HF | ✅ Available, multilingual (es/en/fr/it) |
167
+ | **MedExpQA** (ES) | Medical explanatory QA with reasoning chains. | Mentioned in La Leaderboard paper Table 2; check with authors | ⚠️ Needs confirmation |
168
+ | **Spanish clinical notes (anonymized)** | Real clinical notes for NER/entity linking. | PlanTL-GOB-ES, BSC projects | ⚠️ Privacy-restricted |
169
+ | **Spanish medical Wikipedia** | Medical articles from Spanish Wikipedia. | `es.wikipedia.org` (CC BY-SA) | ✅ Public; needs QA generation |
170
+ | **Medicina en español (MEDLINE)** | Spanish-language biomedical abstracts. | PubMed/MEDLINE Spanish filter | ✅ Public |
171
+ | **Farmacovigilancia (AEMPS)** | Spanish drug safety reports. | `aemps.gob.es` | ⚠️ Needs request + anonymization |
172
+
173
+ **Suggested medical tasks**:
174
+ - **Differential diagnosis**: Given symptoms in Spanish, select most likely diagnosis (from MIR case descriptions)
175
+ - **Drug interaction**: Given two medications, determine if there's a known interaction (from AEMPS data)
176
+ - **Patient-doctor dialogue**: Evaluate if model can produce culturally appropriate Spanish medical explanations
177
+
178
+ ---
179
+
180
+ ### 🎭 4. Cultural Knowledge & Idiosyncrasy Datasets (High Priority Gap)
181
+
182
+ This is the **biggest gap** in current benchmarks. No existing leaderboard adequately tests LLM knowledge of Spanish culture, history, traditions, gastronomy, sports, and regional idiosyncrasies. Inspired by **CulturalBench** (16 countries, but Spain not included) and **BertaQA** (Basque cultural trivia):
183
+
184
+ #### 4.1 Spain-Specific Cultural Knowledge Benchmark (Proposed: *IberiaCult*)
185
+
186
+ **Methodology** (based on CulturalBench / BLEnD):
187
+ 1. Seed queries on Spanish cultural topics
188
+ 2. Human-AI collaborative generation: native Spaniards write questions, LLMs generate distractors, humans validate
189
+ 3. 5-annotator majority vote for correct answers
190
+ 4. MinHash deduplication against major pretraining corpora
191
+
192
+ | Topic Domain | Example Questions | Source of Inspiration |
193
+ |--------------|-------------------|----------------------|
194
+ | **History** | "¿Quién fue el primer presidente de la democracia española?" / "¿En qué año se produjo el 23-F?" | Spanish history curriculum, Wikipedia |
195
+ | **Geography** | "¿Cuál es la capital de La Rioja?" / "¿Qué río separa España de Portugal?" | School geography, IGN maps |
196
+ | **Traditions & Fiestas** | "¿En qué ciudad se celebra las Fallas?" / "¿Qué se arroja en la Tomatina?" | Official fiesta calendars, UNESCO intangible heritage |
197
+ | **Gastronomy** | "¿Cuál es el ingrediente principal del gazpacho andaluz?" / "¿De qué zona es el vino albariño?" | Denominaciones de Origen (MAPA), Spanish cookbooks |
198
+ | **Sports** | "¿Cuántas Ligas ha ganado el Real Madrid?" / "¿Quién ganó la medalla de oro en 100m lisos en Barcelona 92?" | RFEF, COE, press archives |
199
+ | **Politics & Constitution** | "¿Cuántos artículos tiene la Constitución española de 1978?" / "¿Qué comunidades autónomas tienen cooficialidad lingüística?" | Congreso.es, Constitutional text |
200
+ | **Art & Literature** | "¿Quién pintó 'Las Meninas'?" / "¿En qué ciudad nació Federico García Lorca?" | Museo del Prado, RAE, school curriculum |
201
+ | **Music & Cinema** | "¿Qué grupo español cantó 'La ciudad de los gatos'?" / "¿Quién dirigió 'El espíritu de la colmena'?" | RTVE archives, SGAE |
202
+ | **Daily Life & Idioms** | "¿Qué significa la expresión 'tener morro'?" / "¿A qué hora se suele cenar en España?" | RAE DLE, colloquial corpora, Reddit r/Spain |
203
+ | **Regional Knowledge** | "¿En qué provincia se encuentra el Parque Nacional de Ordesa?" / "¿Qué idioma se habla en el Val d'Aran?" | CCAA tourism + education portals |
204
+
205
+ **Potential data sources**:
206
+ - **Trivial Pursuit España** editions (question cards — copyright)
207
+ - **Pasapalabra** (Spanish TV quiz show — transcripts available)
208
+ - **Saber y Ganar** (TVE cultural quiz — public broadcaster, may have archives)
209
+ - **Spanish school textbooks** (history, geography, philosophy — copyright)
210
+ - **Wikipedia es** (CC BY-SA — can generate QA with modern pipelines)
211
+ - **Reddit r/AskSpain, r/Spain** threads (CC — filter for factual questions)
212
+ - **RTVE archives** (public TV — cultural programs with transcripts)
213
+
214
+ ---
215
+
216
+ ### 🌎 5. Latin American Cultural & Linguistic Varieties
217
+
218
+ | Dataset / Source | Language/Variety | Description | How to Obtain |
219
+ |------------------|-----------------|-------------|---------------|
220
+ | **PAES Chile** | es-CL | Chilean university entrance exam (Prueba de Acceso a la Educación Superior). Similar to EBAU. | `demre.cl` publishes PDFs with answer keys. |
221
+ | **ENEM Brasil** (Portuguese) | pt-BR | Brazilian national high school exam. Portuguese baseline for LATAM. | `enem.inep.gov.br` — microdata available. |
222
+ | **Mexican EXANI** | es-MX | Mexican university entrance exam (CENEVAL). | `ceneval.edu.mx` — some materials public. |
223
+ | **Colombian ICFES / Saber 11** | es-CO | Colombian standardized tests. | `icfes.gov.co` — public datasets. |
224
+ | **Argentine CBC / Ingreso** | es-AR | Argentine university entrance exams. | Each university (UBA, UNT, etc.) publishes exams. |
225
+ | **Voces Originarias** | Indigenous | Indigenous voices and languages of LATAM. | Mentioned in La Leaderboard paper; check with authors. |
226
+ | **AmericasNLP** | Indigenous | NLI, MT for Aymara, Guaraní, Quechua, Náhuatl, etc. | `americasnlp.org` + HF Hub |
227
+ | **Meta4XNLI** | Indigenous | Cross-lingual NLI for indigenous languages. | Paper: check authors for data access |
228
+ | **ASALE / RAE DLE** | es | Royal Spanish Academy Dictionary — authority on Spanish usage. | `rae.es` — API for definitions (for linguistic acceptability tasks) |
229
+ | **CREA / Corpus de Referencia del Español Actual** | es | RAE reference corpus of modern Spanish usage. | `rae.es` — may require academic agreement |
230
+ | **CORLEC / Corpus Oral de Lenguaje de Especialidad** | es | Spanish specialized language corpus. | UAM / UCM — academic access |
231
+
232
+ ---
233
+
234
+ ### 🗣️ 6. Regional & Minority Languages of Spain (Beyond Co-Official)
235
+
236
+ The current La Leaderboard covers Spanish, Catalan, Basque, and Galician. Spain has additional recognized languages that lack LLM benchmarks:
237
+
238
+ | Language | Status | Existing Resources | What's Needed |
239
+ |----------|--------|-------------------|---------------|
240
+ | **Aragonese** | Recognized (Aragón) | FLORES+ (WMT 2024): `oldi.org/cards/flores/arg_Latn` | Reading comprehension, QA, NLI benchmark |
241
+ | **Asturian / Bable** | Recognized (Asturias) | FLORES+: `oldi.org/cards/flores/ast_Latn`, PILAR project | Reading comprehension, QA |
242
+ | **Leonese** | Recognized (Castilla y León) | Very limited | Needs corpus creation from oral history projects |
243
+ | **Extremaduran** | Recognized (Extremadura) | Very limited | Needs community collection |
244
+ | **Fala** | Recognized (Extremadura) | Very limited | Needs community collection |
245
+ | **Occitan / Aranese** | Co-official in Val d'Aran (Catalonia) | FLORES+, some Aranese school materials | Reading comprehension, basic QA |
246
+ | **Caló** | Historical Romani variety in Spain | Very limited | Ethical considerations; work with Roma communities |
247
+
248
+ **Suggested approach**: Partner with regional language academies (Academia de la Llingua Asturiana, Academia Aragonesa de la Lengua) to digitize school textbooks and create reading comprehension + QA benchmarks.
249
+
250
+ ---
251
+
252
+ ### 📚 7. Existing Spanish NLP Benchmarks & Shared Tasks
253
+
254
+ Many evaluation datasets have been produced by Spanish NLP shared tasks but are not yet integrated into LLM leaderboards:
255
+
256
+ | Benchmark / Shared Task | Organizer | Task | Years | Data Access |
257
+ |------------------------|-----------|------|-------|-------------|
258
+ | **TASS** (Sentiment Analysis in Spanish) | SEPLN | Sentiment, emotion, irony detection | 2012–present | `tass.sepln.org` — download page |
259
+ | **IberLEF** | SEPLN / Iberian NLP community | Multiple tasks: NER, sentiment, fake news, hate speech, etc. | 2019–present | `ceur-ws.org` proceedings + Zenodo |
260
+ | **Hackathon del ML4ALL** | Various | Spanish-specific ML challenges | Ongoing | GitHub repos |
261
+ | **PlanTL-GOB-ES** | Spanish government NLP initiative | STS-es, SQAC, EsExams, etc. | 2020–present | `huggingface.co/PlanTL-GOB-ES` (some gated) |
262
+ | **HiTZ evaluation suite** | HiTZ (Basque) | EusExams, EusProficiency, BertaQA, etc. | 2022–present | `huggingface.co/HiTZ` |
263
+ | **AINA / BSC** | BSC (Catalan) | CatalanQA, caBreu, CoQCat, etc. | 2023–present | `huggingface.co/projecte-aina` |
264
+ | **ILENIA / USC** | USC (Galician) | GalCoLA, summarization, parafrases | 2023–present | `huggingface.co/proxectonos` |
265
+ | **MedExpQA** | Unknown (mentioned in paper) | Medical explanatory QA in Spanish | — | Contact authors |
266
+ | **Conan EUS** | HiTZ | Counter-narrative generation (Basque/Spanish) | — | Contact authors |
267
+ | **H4rmony Eval** | Unknown | Ethics evaluation in Spanish | — | Contact authors |
268
+ | **VeritasQA** | Unknown | Truthfulness QA in ES/CA/GL | — | Contact authors |
269
+
270
  ---
271
+
272
+ ### 🏛️ 8. Institutional & Government Data Portals
273
+
274
+ These portals contain raw data that could be transformed into evaluation datasets:
275
+
276
+ | Portal | Institution | Content | How to Use |
277
+ |--------|-------------|---------|------------|
278
+ | **datos.gob.es** | Spanish open data portal | Datasets from all ministries | Filter for text-heavy datasets; generate QA |
279
+ | **transparencia.gob.es** | Government transparency | Reports, budgets, contracts | Summarization, NLI, document QA |
280
+ | **boe.es** | Boletín Oficial del Estado | All Spanish legislation | Legal comprehension, summarization |
281
+ | **senado.es / congreso.es** | Parliament | Debates, bills, questions | Political reasoning, summarization |
282
+ | **ine.es** | National Statistics Institute | Statistical reports, surveys | Data-to-text generation, QA |
283
+ | **aemet.es** | Meteorology agency | Weather forecasts, alerts | Domain-specific text generation |
284
+ | **eldiario.es / elpais.com / 20minutos.es** | Press (archives) | News articles (with paywalls) | Summarization, NLI, but copyright-restricted |
285
+ | **rtve.es / rtve.es/alacarta** | Public broadcaster | TV/radio transcripts, subtitles | Public service content; potential for speech-to-text eval |
286
+ | **bne.es** | National Library | Digitized books, newspapers | Historical Spanish text; OCR quality varies |
287
+
288
+ ---
289
+
290
+ ### 🎯 9. Recommended Priorities for Dataset Expansion
291
+
292
+ Based on the research above, we recommend the following priority order for expanding La Leaderboard's cultural and linguistic coverage:
293
+
294
+ | Priority | Dataset | Effort | Impact | Lead |
295
+ |----------|---------|--------|--------|------|
296
+ | 🔴 **P0** | **EBAU/Selectividad digitization** (all 17 CCAA, 2015–2025) | High (scraping + PDF parsing + manual review) | Very High — tests general knowledge + reasoning in Spanish | Community + NLP Spain |
297
+ | 🔴 **P0** | **Spain Cultural Knowledge Benchmark** (*IberiaCult*) | Medium (human annotation + AI generation) | Very High — unique cultural evaluation | Cultural institutions + NLP Spain |
298
+ | 🟡 **P1** | **Expand CasiMedicos** with 2023–2025 MIR/EIR/QIR/FIR/PIR | Low-Medium (update existing pipeline) | High — medical reasoning in Spanish | HiTZ / medical community |
299
+ | 🟡 **P1** | **Legal benchmark** from BOE + CENDOJ + bar exam prep | Medium | High — legal reasoning | Law faculties + NLP Spain |
300
+ | 🟡 **P1** | **PAES Chile + LATAM exams** | Medium | High — LATAM variety coverage | LATAM NLP communities |
301
+ | 🟢 **P2** | **Aragonese / Asturian / Occitan benchmarks** | Medium-High (corpus creation) | Medium — linguistic diversity | Regional language academies |
302
+ | 🟢 **P2** | **TASS / IberLEF task integration** | Low (data already exists) | Medium | SEPLN community |
303
+ | 🔵 **P3** | **Valencian-specific tasks** (distinct from Catalan) | Low-Medium | Medium — regional identity | AINA / Valencia NLP |
304
+ | 🔵 **P3** | **Portuguese (Brazil + Portugal) benchmarks** | Low-Medium | Medium — Lusophone coverage | Portugal / Brazil NLP |
305
+
306
+ ---
307
+
308
+ ### 🤝 How to Contribute a Dataset
309
+
310
+ 1. **Check** if the dataset is already on HuggingFace Hub or listed above
311
+ 2. **Prepare** the data in one of these formats:
312
+ - **MCQ**: `{"question": "...", "choices": ["A", "B", "C", "D"], "answer": 0, ...}`
313
+ - **QA**: `{"question": "...", "context": "...", "answers": ["..."], ...}`
314
+ - **NLI**: `{"premise": "...", "hypothesis": "...", "label": "entailment|neutral|contradiction"}`
315
+ - **Summarization**: `{"article": "...", "summary": "..."}`
316
+ 3. **Create** a HuggingFace Dataset repo with a Dataset Card
317
+ 4. **Write** a YAML task definition for `lm-eval-harness` (see examples above)
318
+ 5. **Open** a discussion on [La Leaderboard v2 Community](https://huggingface.co/spaces/pauvanbr/la-leaderboard-v2/discussions) or submit a PR
319
+
320
+ ---
321
+
322
+ ### 📖 References & Further Reading
323
+
324
+ | Paper / Resource | Relevance |
325
+ |-----------------|-----------|
326
+ | *La Leaderboard* (ACL 2025, arXiv:2507.00999) | Original leaderboard methodology |
327
+ | *IberBench* (arXiv:2504.16921) | Complementary Iberian benchmark with 101 datasets |
328
+ | *LLMzSzŁ* (arXiv:2501.02266) | Polish national exam digitization pipeline — model for EBAU |
329
+ | *KMMLU* / *Open Ko-LLM* (arXiv:2402.11548, 2410.12445) | Korean exam benchmarks + private test sets |
330
+ | *CulturalBench* (arXiv:2410.02677) | Methodology for cultural knowledge benchmarks |
331
+ | *BLEnD* (arXiv:2406.09948) | Multicultural QA across 16 countries |
332
+ | *BertaQA* (arXiv:2406.07302) | Basque cultural trivia — model for regional cultural QA |
333
+ | *EXAMS* (EMNLP 2020, arXiv:2011.03080) | Multilingual high school exams (includes 235 Spanish questions) |
334
+ | *CasiMedicos* (arXiv:2503.00025) | Spanish MIR evaluation |
335
+ | *PARIKSHA* (arXiv:2406.15053) | Human-LLM evaluator agreement on multicultural data |
336
+
337
  ---
338
 
339
+ *This README and annex are living documents. If you know of a dataset, exam, or benchmark source not listed here, please open a discussion or PR!* 💛