# La Leaderboard v2 — LLM Leaderboard for Ibero-American Languages [![Paper](https://img.shields.io/badge/ACL%202025-Paper-blue)](https://aclanthology.org/2025.acl-long.1561/) [![Space](https://img.shields.io/badge/🤗%20Space-La%20Leaderboard-orange)](https://huggingface.co/spaces/pauvanbr/la-leaderboard-v2) [![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0) **La Leaderboard** is the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. This is the **v2** update with an upgraded evaluation framework, expanded language support, and an open roadmap for culturally-relevant datasets. ## 🚀 What's New in v2 - **Modern lm-evaluation-harness**: Upgraded from a legacy fork to `lm-eval>=0.4.11` with YAML-based task definitions and the modern CLI (`lm-eval run`) - **Expanded language support**: Added Valencian (VA) and Portuguese (PT) language tabs - **Updated dependencies**: Transformers 4.51, Gradio 5.25, Datasets 3.5, Python 3.11+ - **Extended precision support**: 8bit, 4bit, GPTQ - **Open dataset roadmap**: New README annex documenting potential culturally-relevant datasets for Spain and LATAM ## 📊 Architecture ``` la-leaderboard-v2/ # Gradio Space (frontend) la-leaderboard-v2-requests/ # HF Dataset (evaluation queue) la-leaderboard-v2-results/ # HF Dataset (evaluation results) ``` ### Frontend (this repo) - `app.py` — Gradio UI with tabs: Summary, ES, CA, EU, GL, VA, PT, Time/CO2, Info, Tasks, Submit - `src/` — Data processing, leaderboard rendering, submission handling - `tasks/` — Task registry (CSV + generated JSON/YAML for harness) ### Backend (separate Space recommended) - Polls `requests` dataset for PENDING evaluations - Runs `lm-eval run` with `--include_path` pointing to custom task YAMLs - Pushes results to `results` dataset ## 🛠️ Reproducibility ### Install modern lm-eval-harness ```bash pip install lm-eval>=0.4.11 ``` ### Run full leaderboard evaluation ```bash lm-eval run --model hf \ --model_args "pretrained=,revision=,dtype=" \ --tasks=laleaderboard \ --num_fewshot=5 \ --device="cuda:0" \ --batch_size=auto \ --output_path= ``` ### Run single-language evaluation ```bash lm-eval run --model hf \ --model_args "pretrained=,revision=,dtype=" \ --tasks=laleaderboard_es \ --num_fewshot=5 \ --device="cuda:0" \ --batch_size=auto ``` Supported language suffixes: `es`, `ca`, `eu`, `gl`, `va`, `pt`. ### Validate custom tasks ```bash lm-eval validate --tasks my_custom_task --include_path ./custom_tasks ``` ## 📁 Task Registry Tasks are defined in `tasks/tasks.csv`. Run `python tasks/generate.py` to regenerate: - `tasks/backend.json` — Flat list of harness task names for the backend - `tasks/dummy_results.json` — Template result JSON for new submissions - `tasks/harness.yaml` — Harness benchmark group definition ## 🧩 Custom Task Format (YAML) The modern harness uses YAML-based task configs. Example for a Spanish exam task: ```yaml tag: - multiple_choice task: ebas_matematicas dataset_path: org/spanish-exam-dataset dataset_name: null output_type: multiple_choice training_split: train validation_split: validation test_split: null doc_to_text: "Pregunta de examen: {{question}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nRespuesta:" doc_to_target: "{{answer}}" doc_to_choice: ["A", "B", "C", "D"] metric_list: - metric: acc aggregation: mean higher_is_better: true metadata: version: 1.0 ``` Place custom task YAMLs in a directory and pass `--include_path ./my_tasks` to `lm-eval run`. ## 📎 Annex — Potential Datasets for Spanish Cultural & Linguistic Evaluation > This annex documents datasets, benchmarks, and data sources that could **add value** to La Leaderboard by capturing the cultural, linguistic, and idiosyncratic richness of Spain and Latin America. Not all are publicly available or digitized yet; this is a living roadmap for community contributions. --- ### 🇪🇸 1. Spanish Government & Competitive Exams (Digitization Required) Following the successful model of **LLMzSzŁ** (Polish national exams) and **KMMLU** (Korean CSAT/bar exams), Spain has a rich ecosystem of standardized competitive exams that are ideal for LLM evaluation: | Exam | Description | Source / How to Obtain | Status | Estimated Size | |------|-------------|----------------------|--------|----------------| | **EBAU / Selectividad / PAU / EvAU / PCE** | University entrance exams by autonomous community. Subjects: Maths, Physics, Chemistry, History, Spanish Language, English, Philosophy. | Each CCAA publishes PDFs on their education portals (e.g., `educacion.jccm.es`, `ensenyament.gencat.cat`, `juntadeandalucia.es/educacion`). Also aggregated on `examenesdepau.com`, `muchosexamenes.com`. | ⚠️ Needs PDF digitization (PyPDF + regex pipeline) | ~5,000–10,000 questions/year across 17 CCAA × 10 years | | **Oposiciones** (Civil Service) | Competitive exams for public administration positions: administrative, teaching, nursing, firefighter, police. Extremely diverse subject matter. | `boe.es` (convocatorias), `aprende.gob.es`, community portals. Some already digitized on preparation platforms. | ⚠️ Fragmented; needs community scraping + OCR for image-based PDFs | Highly variable; potentially 20,000+ questions | | **MIR** (Medical Residency) | Already partially digitized by `casimedicos.com` and on HF as `HiTZ/casimedicos-exp`. | `casimedicos.com`, `mirial.es`, Ministry of Health. | ✅ Partially available; needs 2023–2025 updates | ~3,000–5,000 questions/year | | **EIR** (Nursing Residency) | Similar structure to MIR, for nursing. | `casimedicos.com` (nursing section), Ministry of Health. | ⚠️ Needs digitization | ~2,000 questions/year | | **QIR** (Pharmacy), **FIR** (Physiotherapy), **PIR** (Psychology) | Specialized medical residency exams. | Same sources as MIR. | ⚠️ Needs digitization | ~1,000–2,000 each/year | | **PEvAU / EBAU Andalucía** | Specific Andalusian university entrance exam with public PDF archive. | `juntadeandalucia.es/educacion` → "Exámenes de acceso a la universidad" | ✅ Public PDFs available | ~500 questions/year | | **Acceso Mayor 25/45** | University access for adults over 25/45. Questions test general knowledge + specific subjects. | CCAA education portals. | ⚠️ Needs digitization | ~1,000 questions/year | | **FP Grado Superior** | Vocational training access exams. Technical + general knowledge. | CCAA education portals, `todofp.es`. | ⚠️ Needs digitization | ~2,000 questions/year | **Digitization pipeline** (based on LLMzSzŁ methodology): 1. Bulk-download PDFs by year/subject from official portals 2. Extract text layer with `PyPDF2`/`pdfplumber` 3. Regex-parse: question number → question text → options A/B/C/D → correct answer 4. Handle image-based PDFs with OCR (`Tesseract`, `easyOCR`, `pymupdf`) 5. Manual review of answer keys (often in separate PDFs) 6. Output: `{"question": "...", "choices": [...], "answer": 0, "subject": "Historia", "year": 2023, "ccaa": "Andalucía"}` **Important**: To avoid data contamination, test splits should be held back and NOT published until after models are evaluated, following the KMMLU private-test-set model. --- ### 🏛️ 2. Legal Datasets (Spain-Specific) | Dataset | Description | Source | Status | Format | |---------|-------------|--------|--------|--------| | **BOE Legal Corpus** | Full text of Spanish Official State Gazette. Massive legal text resource. | `boe.es` (daily bulletin), `Pepere45/spanish-boe-legal-corpus` on HF | ⚠️ Gated / needs request | Text / XML | | **CENDOJ** (Centro de Documentación Judicial) | Spanish case law (jurisprudencia) from Supreme Court, National Court, etc. | `cej-mjusticia.es` — has API for XML download | ✅ API available | XML | | **BOE-XSUM** | Extreme summarization of BOE documents into plain Spanish. Tests legal comprehension + simplification. | Paper: arXiv:2509.24908 | ⚠️ Needs re-creation or permission | Text pairs | | **PlanTL-GOB-ES lm-legal-es** | Spanish legal-domain language model + corpus from BOE, CENDOJ. | GitHub: `PlanTL-GOB-ES/lm-legal-es` | ⚠️ Partially available | Text | | **SpaLawEx** | Already in La Leaderboard! Spanish Law School Access Exams. | `LenguajeNaturalAI/examenes_abogacia` on HF | ✅ Available | MCQ | | **European Court of Human Rights (Spanish)** | Case descriptions + violation/no-violation labels. | `icLR/echr_cases` or build from HUDOC | ⚠️ Needs filtering for Spanish | Text + label | | **Spanish Constitutional Court rulings** | TC rulings with subject classification. | `tribunalconstitucional.es` | ⚠️ Needs scraping + annotation | Text | | **EU Law (Spanish)** | EUR-Lex corpus in Spanish. EU directives, regulations, decisions. | `eur-lex.europa.eu` | ✅ Public; needs download pipeline | Text / XML | **Suggested task formats**: - **Legal MCQ**: Extract multiple-choice questions from bar exam prep books (e.g., *Civitas*, *Tecnos*). Copyright-restricted but publishers may collaborate. - **Legal NLI**: Given a BOE article + a citizen query, determine if the article entails/contradicts/is neutral to the query. - **Legal summarization**: Simplify a BOE resolution into layperson Spanish. - **Legal entailment**: Given a CENDOJ ruling summary, determine which legal principle applies. --- ### 🏥 3. Medical & Health Datasets | Dataset | Description | Source | Status | |---------|-------------|--------|--------| | **CasiMedicos / HEAD-QA v2** | Already in La Leaderboard! MIR medical exam with explanations. | `HiTZ/casimedicos-exp` on HF | ✅ Available, multilingual (es/en/fr/it) | | **MedExpQA** (ES) | Medical explanatory QA with reasoning chains. | Mentioned in La Leaderboard paper Table 2; check with authors | ⚠️ Needs confirmation | | **Spanish clinical notes (anonymized)** | Real clinical notes for NER/entity linking. | PlanTL-GOB-ES, BSC projects | ⚠️ Privacy-restricted | | **Spanish medical Wikipedia** | Medical articles from Spanish Wikipedia. | `es.wikipedia.org` (CC BY-SA) | ✅ Public; needs QA generation | | **Medicina en español (MEDLINE)** | Spanish-language biomedical abstracts. | PubMed/MEDLINE Spanish filter | ✅ Public | | **Farmacovigilancia (AEMPS)** | Spanish drug safety reports. | `aemps.gob.es` | ⚠️ Needs request + anonymization | **Suggested medical tasks**: - **Differential diagnosis**: Given symptoms in Spanish, select most likely diagnosis (from MIR case descriptions) - **Drug interaction**: Given two medications, determine if there's a known interaction (from AEMPS data) - **Patient-doctor dialogue**: Evaluate if model can produce culturally appropriate Spanish medical explanations --- ### 🎭 4. Cultural Knowledge & Idiosyncrasy Datasets (High Priority Gap) This is the **biggest gap** in current benchmarks. No existing leaderboard adequately tests LLM knowledge of Spanish culture, history, traditions, gastronomy, sports, and regional idiosyncrasies. Inspired by **CulturalBench** (16 countries, but Spain not included) and **BertaQA** (Basque cultural trivia): #### 4.1 Spain-Specific Cultural Knowledge Benchmark (Proposed: *IberiaCult*) **Methodology** (based on CulturalBench / BLEnD): 1. Seed queries on Spanish cultural topics 2. Human-AI collaborative generation: native Spaniards write questions, LLMs generate distractors, humans validate 3. 5-annotator majority vote for correct answers 4. MinHash deduplication against major pretraining corpora | Topic Domain | Example Questions | Source of Inspiration | |--------------|-------------------|----------------------| | **History** | "¿Quién fue el primer presidente de la democracia española?" / "¿En qué año se produjo el 23-F?" | Spanish history curriculum, Wikipedia | | **Geography** | "¿Cuál es la capital de La Rioja?" / "¿Qué río separa España de Portugal?" | School geography, IGN maps | | **Traditions & Fiestas** | "¿En qué ciudad se celebra las Fallas?" / "¿Qué se arroja en la Tomatina?" | Official fiesta calendars, UNESCO intangible heritage | | **Gastronomy** | "¿Cuál es el ingrediente principal del gazpacho andaluz?" / "¿De qué zona es el vino albariño?" | Denominaciones de Origen (MAPA), Spanish cookbooks | | **Sports** | "¿Cuántas Ligas ha ganado el Real Madrid?" / "¿Quién ganó la medalla de oro en 100m lisos en Barcelona 92?" | RFEF, COE, press archives | | **Politics & Constitution** | "¿Cuántos artículos tiene la Constitución española de 1978?" / "¿Qué comunidades autónomas tienen cooficialidad lingüística?" | Congreso.es, Constitutional text | | **Art & Literature** | "¿Quién pintó 'Las Meninas'?" / "¿En qué ciudad nació Federico García Lorca?" | Museo del Prado, RAE, school curriculum | | **Music & Cinema** | "¿Qué grupo español cantó 'La ciudad de los gatos'?" / "¿Quién dirigió 'El espíritu de la colmena'?" | RTVE archives, SGAE | | **Daily Life & Idioms** | "¿Qué significa la expresión 'tener morro'?" / "¿A qué hora se suele cenar en España?" | RAE DLE, colloquial corpora, Reddit r/Spain | | **Regional Knowledge** | "¿En qué provincia se encuentra el Parque Nacional de Ordesa?" / "¿Qué idioma se habla en el Val d'Aran?" | CCAA tourism + education portals | **Potential data sources**: - **Trivial Pursuit España** editions (question cards — copyright) - **Pasapalabra** (Spanish TV quiz show — transcripts available) - **Saber y Ganar** (TVE cultural quiz — public broadcaster, may have archives) - **Spanish school textbooks** (history, geography, philosophy — copyright) - **Wikipedia es** (CC BY-SA — can generate QA with modern pipelines) - **Reddit r/AskSpain, r/Spain** threads (CC — filter for factual questions) - **RTVE archives** (public TV — cultural programs with transcripts) --- ### 🌎 5. Latin American Cultural & Linguistic Varieties | Dataset / Source | Language/Variety | Description | How to Obtain | |------------------|-----------------|-------------|---------------| | **PAES Chile** | es-CL | Chilean university entrance exam (Prueba de Acceso a la Educación Superior). Similar to EBAU. | `demre.cl` publishes PDFs with answer keys. | | **ENEM Brasil** (Portuguese) | pt-BR | Brazilian national high school exam. Portuguese baseline for LATAM. | `enem.inep.gov.br` — microdata available. | | **Mexican EXANI** | es-MX | Mexican university entrance exam (CENEVAL). | `ceneval.edu.mx` — some materials public. | | **Colombian ICFES / Saber 11** | es-CO | Colombian standardized tests. | `icfes.gov.co` — public datasets. | | **Argentine CBC / Ingreso** | es-AR | Argentine university entrance exams. | Each university (UBA, UNT, etc.) publishes exams. | | **Voces Originarias** | Indigenous | Indigenous voices and languages of LATAM. | Mentioned in La Leaderboard paper; check with authors. | | **AmericasNLP** | Indigenous | NLI, MT for Aymara, Guaraní, Quechua, Náhuatl, etc. | `americasnlp.org` + HF Hub | | **Meta4XNLI** | Indigenous | Cross-lingual NLI for indigenous languages. | Paper: check authors for data access | | **ASALE / RAE DLE** | es | Royal Spanish Academy Dictionary — authority on Spanish usage. | `rae.es` — API for definitions (for linguistic acceptability tasks) | | **CREA / Corpus de Referencia del Español Actual** | es | RAE reference corpus of modern Spanish usage. | `rae.es` — may require academic agreement | | **CORLEC / Corpus Oral de Lenguaje de Especialidad** | es | Spanish specialized language corpus. | UAM / UCM — academic access | --- ### 🗣️ 6. Regional & Minority Languages of Spain (Beyond Co-Official) The current La Leaderboard covers Spanish, Catalan, Basque, and Galician. Spain has additional recognized languages that lack LLM benchmarks: | Language | Status | Existing Resources | What's Needed | |----------|--------|-------------------|---------------| | **Aragonese** | Recognized (Aragón) | FLORES+ (WMT 2024): `oldi.org/cards/flores/arg_Latn` | Reading comprehension, QA, NLI benchmark | | **Asturian / Bable** | Recognized (Asturias) | FLORES+: `oldi.org/cards/flores/ast_Latn`, PILAR project | Reading comprehension, QA | | **Leonese** | Recognized (Castilla y León) | Very limited | Needs corpus creation from oral history projects | | **Extremaduran** | Recognized (Extremadura) | Very limited | Needs community collection | | **Fala** | Recognized (Extremadura) | Very limited | Needs community collection | | **Occitan / Aranese** | Co-official in Val d'Aran (Catalonia) | FLORES+, some Aranese school materials | Reading comprehension, basic QA | | **Caló** | Historical Romani variety in Spain | Very limited | Ethical considerations; work with Roma communities | **Suggested approach**: Partner with regional language academies (Academia de la Llingua Asturiana, Academia Aragonesa de la Lengua) to digitize school textbooks and create reading comprehension + QA benchmarks. --- ### 📚 7. Existing Spanish NLP Benchmarks & Shared Tasks Many evaluation datasets have been produced by Spanish NLP shared tasks but are not yet integrated into LLM leaderboards: | Benchmark / Shared Task | Organizer | Task | Years | Data Access | |------------------------|-----------|------|-------|-------------| | **TASS** (Sentiment Analysis in Spanish) | SEPLN | Sentiment, emotion, irony detection | 2012–present | `tass.sepln.org` — download page | | **IberLEF** | SEPLN / Iberian NLP community | Multiple tasks: NER, sentiment, fake news, hate speech, etc. | 2019–present | `ceur-ws.org` proceedings + Zenodo | | **Hackathon del ML4ALL** | Various | Spanish-specific ML challenges | Ongoing | GitHub repos | | **PlanTL-GOB-ES** | Spanish government NLP initiative | STS-es, SQAC, EsExams, etc. | 2020–present | `huggingface.co/PlanTL-GOB-ES` (some gated) | | **HiTZ evaluation suite** | HiTZ (Basque) | EusExams, EusProficiency, BertaQA, etc. | 2022–present | `huggingface.co/HiTZ` | | **AINA / BSC** | BSC (Catalan) | CatalanQA, caBreu, CoQCat, etc. | 2023–present | `huggingface.co/projecte-aina` | | **ILENIA / USC** | USC (Galician) | GalCoLA, summarization, parafrases | 2023–present | `huggingface.co/proxectonos` | | **MedExpQA** | Unknown (mentioned in paper) | Medical explanatory QA in Spanish | — | Contact authors | | **Conan EUS** | HiTZ | Counter-narrative generation (Basque/Spanish) | — | Contact authors | | **H4rmony Eval** | Unknown | Ethics evaluation in Spanish | — | Contact authors | | **VeritasQA** | Unknown | Truthfulness QA in ES/CA/GL | — | Contact authors | --- ### 🏛️ 8. Institutional & Government Data Portals These portals contain raw data that could be transformed into evaluation datasets: | Portal | Institution | Content | How to Use | |--------|-------------|---------|------------| | **datos.gob.es** | Spanish open data portal | Datasets from all ministries | Filter for text-heavy datasets; generate QA | | **transparencia.gob.es** | Government transparency | Reports, budgets, contracts | Summarization, NLI, document QA | | **boe.es** | Boletín Oficial del Estado | All Spanish legislation | Legal comprehension, summarization | | **senado.es / congreso.es** | Parliament | Debates, bills, questions | Political reasoning, summarization | | **ine.es** | National Statistics Institute | Statistical reports, surveys | Data-to-text generation, QA | | **aemet.es** | Meteorology agency | Weather forecasts, alerts | Domain-specific text generation | | **eldiario.es / elpais.com / 20minutos.es** | Press (archives) | News articles (with paywalls) | Summarization, NLI, but copyright-restricted | | **rtve.es / rtve.es/alacarta** | Public broadcaster | TV/radio transcripts, subtitles | Public service content; potential for speech-to-text eval | | **bne.es** | National Library | Digitized books, newspapers | Historical Spanish text; OCR quality varies | --- ### 🎯 9. Recommended Priorities for Dataset Expansion Based on the research above, we recommend the following priority order for expanding La Leaderboard's cultural and linguistic coverage: | Priority | Dataset | Effort | Impact | Lead | |----------|---------|--------|--------|------| | 🔴 **P0** | **EBAU/Selectividad digitization** (all 17 CCAA, 2015–2025) | High (scraping + PDF parsing + manual review) | Very High — tests general knowledge + reasoning in Spanish | Community + NLP Spain | | 🔴 **P0** | **Spain Cultural Knowledge Benchmark** (*IberiaCult*) | Medium (human annotation + AI generation) | Very High — unique cultural evaluation | Cultural institutions + NLP Spain | | 🟡 **P1** | **Expand CasiMedicos** with 2023–2025 MIR/EIR/QIR/FIR/PIR | Low-Medium (update existing pipeline) | High — medical reasoning in Spanish | HiTZ / medical community | | 🟡 **P1** | **Legal benchmark** from BOE + CENDOJ + bar exam prep | Medium | High — legal reasoning | Law faculties + NLP Spain | | 🟡 **P1** | **PAES Chile + LATAM exams** | Medium | High — LATAM variety coverage | LATAM NLP communities | | 🟢 **P2** | **Aragonese / Asturian / Occitan benchmarks** | Medium-High (corpus creation) | Medium — linguistic diversity | Regional language academies | | 🟢 **P2** | **TASS / IberLEF task integration** | Low (data already exists) | Medium | SEPLN community | | 🔵 **P3** | **Valencian-specific tasks** (distinct from Catalan) | Low-Medium | Medium — regional identity | AINA / Valencia NLP | | 🔵 **P3** | **Portuguese (Brazil + Portugal) benchmarks** | Low-Medium | Medium — Lusophone coverage | Portugal / Brazil NLP | --- ### 🤝 How to Contribute a Dataset 1. **Check** if the dataset is already on HuggingFace Hub or listed above 2. **Prepare** the data in one of these formats: - **MCQ**: `{"question": "...", "choices": ["A", "B", "C", "D"], "answer": 0, ...}` - **QA**: `{"question": "...", "context": "...", "answers": ["..."], ...}` - **NLI**: `{"premise": "...", "hypothesis": "...", "label": "entailment|neutral|contradiction"}` - **Summarization**: `{"article": "...", "summary": "..."}` 3. **Create** a HuggingFace Dataset repo with a Dataset Card 4. **Write** a YAML task definition for `lm-eval-harness` (see examples above) 5. **Open** a discussion on [La Leaderboard v2 Community](https://huggingface.co/spaces/pauvanbr/la-leaderboard-v2/discussions) or submit a PR --- ### 📖 References & Further Reading | Paper / Resource | Relevance | |-----------------|-----------| | *La Leaderboard* (ACL 2025, arXiv:2507.00999) | Original leaderboard methodology | | *IberBench* (arXiv:2504.16921) | Complementary Iberian benchmark with 101 datasets | | *LLMzSzŁ* (arXiv:2501.02266) | Polish national exam digitization pipeline — model for EBAU | | *KMMLU* / *Open Ko-LLM* (arXiv:2402.11548, 2410.12445) | Korean exam benchmarks + private test sets | | *CulturalBench* (arXiv:2410.02677) | Methodology for cultural knowledge benchmarks | | *BLEnD* (arXiv:2406.09948) | Multicultural QA across 16 countries | | *BertaQA* (arXiv:2406.07302) | Basque cultural trivia — model for regional cultural QA | | *EXAMS* (EMNLP 2020, arXiv:2011.03080) | Multilingual high school exams (includes 235 Spanish questions) | | *CasiMedicos* (arXiv:2503.00025) | Spanish MIR evaluation | | *PARIKSHA* (arXiv:2406.15053) | Human-LLM evaluator agreement on multicultural data | --- *This README and annex are living documents. If you know of a dataset, exam, or benchmark source not listed here, please open a discussion or PR!* 💛