Spaces:

pauvanbr
/

la-leaderboard-v2

Configuration error

App Files Files Community

pauvanbr commited on May 5

Commit

80603f8

verified ·

1 Parent(s): 4b8d216

Upload README.md

Browse files

Files changed (1) hide show

README.md +336 -10

README.md CHANGED Viewed

@@ -1,13 +1,339 @@
 ---
-title: La Leaderboard V2
-emoji: 😻
-colorFrom: indigo
-colorTo: red
-sdk: gradio
-sdk_version: 6.14.0
-python_version: '3.13'
-app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# La Leaderboard v2 — LLM Leaderboard for Ibero-American Languages
+[![Paper](https://img.shields.io/badge/ACL%202025-Paper-blue)](https://aclanthology.org/2025.acl-long.1561/)
+[![Space](https://img.shields.io/badge/🤗%20Space-La%20Leaderboard-orange)](https://huggingface.co/spaces/pauvanbr/la-leaderboard-v2)
+[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
+**La Leaderboard** is the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. This is the **v2** update with an upgraded evaluation framework, expanded language support, and an open roadmap for culturally-relevant datasets.
+## 🚀 What's New in v2
+- **Modern lm-evaluation-harness**: Upgraded from a legacy fork to `lm-eval>=0.4.11` with YAML-based task definitions and the modern CLI (`lm-eval run`)
+- **Expanded language support**: Added Valencian (VA) and Portuguese (PT) language tabs
+- **Updated dependencies**: Transformers 4.51, Gradio 5.25, Datasets 3.5, Python 3.11+
+- **Extended precision support**: 8bit, 4bit, GPTQ
+- **Open dataset roadmap**: New README annex documenting potential culturally-relevant datasets for Spain and LATAM
+## 📊 Architecture
+```
+la-leaderboard-v2/                     # Gradio Space (frontend)
+la-leaderboard-v2-requests/          # HF Dataset (evaluation queue)
+la-leaderboard-v2-results/           # HF Dataset (evaluation results)
+```
+### Frontend (this repo)
+- `app.py` — Gradio UI with tabs: Summary, ES, CA, EU, GL, VA, PT, Time/CO2, Info, Tasks, Submit
+- `src/` — Data processing, leaderboard rendering, submission handling
+- `tasks/` — Task registry (CSV + generated JSON/YAML for harness)
+### Backend (separate Space recommended)
+- Polls `requests` dataset for PENDING evaluations
+- Runs `lm-eval run` with `--include_path` pointing to custom task YAMLs
+- Pushes results to `results` dataset
+## 🛠️ Reproducibility
+### Install modern lm-eval-harness
+```bash
+pip install lm-eval>=0.4.11
+```
+### Run full leaderboard evaluation
+```bash
+lm-eval run --model hf \
+  --model_args "pretrained=<your_model>,revision=<rev>,dtype=<dtype>" \
+  --tasks=laleaderboard \
+  --num_fewshot=5 \
+  --device="cuda:0" \
+  --batch_size=auto \
+  --output_path=<output_path>
+```
+### Run single-language evaluation
+```bash
+lm-eval run --model hf \
+  --model_args "pretrained=<your_model>,revision=<rev>,dtype=<dtype>" \
+  --tasks=laleaderboard_es \
+  --num_fewshot=5 \
+  --device="cuda:0" \
+  --batch_size=auto
+```
+Supported language suffixes: `es`, `ca`, `eu`, `gl`, `va`, `pt`.
+### Validate custom tasks
+```bash
+lm-eval validate --tasks my_custom_task --include_path ./custom_tasks
+```
+## 📁 Task Registry
+Tasks are defined in `tasks/tasks.csv`. Run `python tasks/generate.py` to regenerate:
+- `tasks/backend.json` — Flat list of harness task names for the backend
+- `tasks/dummy_results.json` — Template result JSON for new submissions
+- `tasks/harness.yaml` — Harness benchmark group definition
+## 🧩 Custom Task Format (YAML)
+The modern harness uses YAML-based task configs. Example for a Spanish exam task:
+```yaml
+tag:
+  - multiple_choice
+task: ebas_matematicas
+dataset_path: org/spanish-exam-dataset
+dataset_name: null
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+test_split: null
+doc_to_text: "Pregunta de examen: {{question}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nRespuesta:"
+doc_to_target: "{{answer}}"
+doc_to_choice: ["A", "B", "C", "D"]
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
+```
+Place custom task YAMLs in a directory and pass `--include_path ./my_tasks` to `lm-eval run`.
+## 📎 Annex — Potential Datasets for Spanish Cultural & Linguistic Evaluation
+> This annex documents datasets, benchmarks, and data sources that could **add value** to La Leaderboard by capturing the cultural, linguistic, and idiosyncratic richness of Spain and Latin America. Not all are publicly available or digitized yet; this is a living roadmap for community contributions.
+---
+### 🇪🇸 1. Spanish Government & Competitive Exams (Digitization Required)
+Following the successful model of **LLMzSzŁ** (Polish national exams) and **KMMLU** (Korean CSAT/bar exams), Spain has a rich ecosystem of standardized competitive exams that are ideal for LLM evaluation:
+| Exam | Description | Source / How to Obtain | Status | Estimated Size |
+|------|-------------|----------------------|--------|----------------|
+| **EBAU / Selectividad / PAU / EvAU / PCE** | University entrance exams by autonomous community. Subjects: Maths, Physics, Chemistry, History, Spanish Language, English, Philosophy. | Each CCAA publishes PDFs on their education portals (e.g., `educacion.jccm.es`, `ensenyament.gencat.cat`, `juntadeandalucia.es/educacion`). Also aggregated on `examenesdepau.com`, `muchosexamenes.com`. | ⚠️ Needs PDF digitization (PyPDF + regex pipeline) | ~5,000–10,000 questions/year across 17 CCAA × 10 years |
+| **Oposiciones** (Civil Service) | Competitive exams for public administration positions: administrative, teaching, nursing, firefighter, police. Extremely diverse subject matter. | `boe.es` (convocatorias), `aprende.gob.es`, community portals. Some already digitized on preparation platforms. | ⚠️ Fragmented; needs community scraping + OCR for image-based PDFs | Highly variable; potentially 20,000+ questions |
+| **MIR** (Medical Residency) | Already partially digitized by `casimedicos.com` and on HF as `HiTZ/casimedicos-exp`. | `casimedicos.com`, `mirial.es`, Ministry of Health. | ✅ Partially available; needs 2023–2025 updates | ~3,000–5,000 questions/year |
+| **EIR** (Nursing Residency) | Similar structure to MIR, for nursing. | `casimedicos.com` (nursing section), Ministry of Health. | ⚠️ Needs digitization | ~2,000 questions/year |
+| **QIR** (Pharmacy), **FIR** (Physiotherapy), **PIR** (Psychology) | Specialized medical residency exams. | Same sources as MIR. | ⚠️ Needs digitization | ~1,000–2,000 each/year |
+| **PEvAU / EBAU Andalucía** | Specific Andalusian university entrance exam with public PDF archive. | `juntadeandalucia.es/educacion` → "Exámenes de acceso a la universidad" | ✅ Public PDFs available | ~500 questions/year |
+| **Acceso Mayor 25/45** | University access for adults over 25/45. Questions test general knowledge + specific subjects. | CCAA education portals. | ⚠️ Needs digitization | ~1,000 questions/year |
+| **FP Grado Superior** | Vocational training access exams. Technical + general knowledge. | CCAA education portals, `todofp.es`. | ⚠️ Needs digitization | ~2,000 questions/year |
+**Digitization pipeline** (based on LLMzSzŁ methodology):
+1. Bulk-download PDFs by year/subject from official portals
+2. Extract text layer with `PyPDF2`/`pdfplumber`
+3. Regex-parse: question number → question text → options A/B/C/D → correct answer
+4. Handle image-based PDFs with OCR (`Tesseract`, `easyOCR`, `pymupdf`)
+5. Manual review of answer keys (often in separate PDFs)
+6. Output: `{"question": "...", "choices": [...], "answer": 0, "subject": "Historia", "year": 2023, "ccaa": "Andalucía"}`
+**Important**: To avoid data contamination, test splits should be held back and NOT published until after models are evaluated, following the KMMLU private-test-set model.
+---
+### 🏛️ 2. Legal Datasets (Spain-Specific)
+| Dataset | Description | Source | Status | Format |
+|---------|-------------|--------|--------|--------|
+| **BOE Legal Corpus** | Full text of Spanish Official State Gazette. Massive legal text resource. | `boe.es` (daily bulletin), `Pepere45/spanish-boe-legal-corpus` on HF | ⚠️ Gated / needs request | Text / XML |
+| **CENDOJ** (Centro de Documentación Judicial) | Spanish case law (jurisprudencia) from Supreme Court, National Court, etc. | `cej-mjusticia.es` — has API for XML download | ✅ API available | XML |
+| **BOE-XSUM** | Extreme summarization of BOE documents into plain Spanish. Tests legal comprehension + simplification. | Paper: arXiv:2509.24908 | ⚠️ Needs re-creation or permission | Text pairs |
+| **PlanTL-GOB-ES lm-legal-es** | Spanish legal-domain language model + corpus from BOE, CENDOJ. | GitHub: `PlanTL-GOB-ES/lm-legal-es` | ⚠️ Partially available | Text |
+| **SpaLawEx** | Already in La Leaderboard! Spanish Law School Access Exams. | `LenguajeNaturalAI/examenes_abogacia` on HF | ✅ Available | MCQ |
+| **European Court of Human Rights (Spanish)** | Case descriptions + violation/no-violation labels. | `icLR/echr_cases` or build from HUDOC | ⚠️ Needs filtering for Spanish | Text + label |
+| **Spanish Constitutional Court rulings** | TC rulings with subject classification. | `tribunalconstitucional.es` | ⚠️ Needs scraping + annotation | Text |
+| **EU Law (Spanish)** | EUR-Lex corpus in Spanish. EU directives, regulations, decisions. | `eur-lex.europa.eu` | ✅ Public; needs download pipeline | Text / XML |
+**Suggested task formats**:
+- **Legal MCQ**: Extract multiple-choice questions from bar exam prep books (e.g., *Civitas*, *Tecnos*). Copyright-restricted but publishers may collaborate.
+- **Legal NLI**: Given a BOE article + a citizen query, determine if the article entails/contradicts/is neutral to the query.
+- **Legal summarization**: Simplify a BOE resolution into layperson Spanish.
+- **Legal entailment**: Given a CENDOJ ruling summary, determine which legal principle applies.
+---
+### 🏥 3. Medical & Health Datasets
+| Dataset | Description | Source | Status |
+|---------|-------------|--------|--------|
+| **CasiMedicos / HEAD-QA v2** | Already in La Leaderboard! MIR medical exam with explanations. | `HiTZ/casimedicos-exp` on HF | ✅ Available, multilingual (es/en/fr/it) |
+| **MedExpQA** (ES) | Medical explanatory QA with reasoning chains. | Mentioned in La Leaderboard paper Table 2; check with authors | ⚠️ Needs confirmation |
+| **Spanish clinical notes (anonymized)** | Real clinical notes for NER/entity linking. | PlanTL-GOB-ES, BSC projects | ⚠️ Privacy-restricted |
+| **Spanish medical Wikipedia** | Medical articles from Spanish Wikipedia. | `es.wikipedia.org` (CC BY-SA) | ✅ Public; needs QA generation |
+| **Medicina en español (MEDLINE)** | Spanish-language biomedical abstracts. | PubMed/MEDLINE Spanish filter | ✅ Public |
+| **Farmacovigilancia (AEMPS)** | Spanish drug safety reports. | `aemps.gob.es` | ⚠️ Needs request + anonymization |
+**Suggested medical tasks**:
+- **Differential diagnosis**: Given symptoms in Spanish, select most likely diagnosis (from MIR case descriptions)
+- **Drug interaction**: Given two medications, determine if there's a known interaction (from AEMPS data)
+- **Patient-doctor dialogue**: Evaluate if model can produce culturally appropriate Spanish medical explanations
+---
+### 🎭 4. Cultural Knowledge & Idiosyncrasy Datasets (High Priority Gap)
+This is the **biggest gap** in current benchmarks. No existing leaderboard adequately tests LLM knowledge of Spanish culture, history, traditions, gastronomy, sports, and regional idiosyncrasies. Inspired by **CulturalBench** (16 countries, but Spain not included) and **BertaQA** (Basque cultural trivia):
+#### 4.1 Spain-Specific Cultural Knowledge Benchmark (Proposed: *IberiaCult*)
+**Methodology** (based on CulturalBench / BLEnD):
+1. Seed queries on Spanish cultural topics
+2. Human-AI collaborative generation: native Spaniards write questions, LLMs generate distractors, humans validate
+3. 5-annotator majority vote for correct answers
+4. MinHash deduplication against major pretraining corpora
+| Topic Domain | Example Questions | Source of Inspiration |
+|--------------|-------------------|----------------------|
+| **History** | "¿Quién fue el primer presidente de la democracia española?" / "¿En qué año se produjo el 23-F?" | Spanish history curriculum, Wikipedia |
+| **Geography** | "¿Cuál es la capital de La Rioja?" / "¿Qué río separa España de Portugal?" | School geography, IGN maps |
+| **Traditions & Fiestas** | "¿En qué ciudad se celebra las Fallas?" / "¿Qué se arroja en la Tomatina?" | Official fiesta calendars, UNESCO intangible heritage |
+| **Gastronomy** | "¿Cuál es el ingrediente principal del gazpacho andaluz?" / "¿De qué zona es el vino albariño?" | Denominaciones de Origen (MAPA), Spanish cookbooks |
+| **Sports** | "¿Cuántas Ligas ha ganado el Real Madrid?" / "¿Quién ganó la medalla de oro en 100m lisos en Barcelona 92?" | RFEF, COE, press archives |
+| **Politics & Constitution** | "¿Cuántos artículos tiene la Constitución española de 1978?" / "¿Qué comunidades autónomas tienen cooficialidad lingüística?" | Congreso.es, Constitutional text |
+| **Art & Literature** | "¿Quién pintó 'Las Meninas'?" / "¿En qué ciudad nació Federico García Lorca?" | Museo del Prado, RAE, school curriculum |
+| **Music & Cinema** | "¿Qué grupo español cantó 'La ciudad de los gatos'?" / "¿Quién dirigió 'El espíritu de la colmena'?" | RTVE archives, SGAE |
+| **Daily Life & Idioms** | "¿Qué significa la expresión 'tener morro'?" / "¿A qué hora se suele cenar en España?" | RAE DLE, colloquial corpora, Reddit r/Spain |
+| **Regional Knowledge** | "¿En qué provincia se encuentra el Parque Nacional de Ordesa?" / "¿Qué idioma se habla en el Val d'Aran?" | CCAA tourism + education portals |
+**Potential data sources**:
+- **Trivial Pursuit España** editions (question cards — copyright)
+- **Pasapalabra** (Spanish TV quiz show — transcripts available)
+- **Saber y Ganar** (TVE cultural quiz — public broadcaster, may have archives)
+- **Spanish school textbooks** (history, geography, philosophy — copyright)
+- **Wikipedia es** (CC BY-SA — can generate QA with modern pipelines)
+- **Reddit r/AskSpain, r/Spain** threads (CC — filter for factual questions)
+- **RTVE archives** (public TV — cultural programs with transcripts)
+---
+### 🌎 5. Latin American Cultural & Linguistic Varieties
+| Dataset / Source | Language/Variety | Description | How to Obtain |
+|------------------|-----------------|-------------|---------------|
+| **PAES Chile** | es-CL | Chilean university entrance exam (Prueba de Acceso a la Educación Superior). Similar to EBAU. | `demre.cl` publishes PDFs with answer keys. |
+| **ENEM Brasil** (Portuguese) | pt-BR | Brazilian national high school exam. Portuguese baseline for LATAM. | `enem.inep.gov.br` — microdata available. |
+| **Mexican EXANI** | es-MX | Mexican university entrance exam (CENEVAL). | `ceneval.edu.mx` — some materials public. |
+| **Colombian ICFES / Saber 11** | es-CO | Colombian standardized tests. | `icfes.gov.co` — public datasets. |
+| **Argentine CBC / Ingreso** | es-AR | Argentine university entrance exams. | Each university (UBA, UNT, etc.) publishes exams. |
+| **Voces Originarias** | Indigenous | Indigenous voices and languages of LATAM. | Mentioned in La Leaderboard paper; check with authors. |
+| **AmericasNLP** | Indigenous | NLI, MT for Aymara, Guaraní, Quechua, Náhuatl, etc. | `americasnlp.org` + HF Hub |
+| **Meta4XNLI** | Indigenous | Cross-lingual NLI for indigenous languages. | Paper: check authors for data access |
+| **ASALE / RAE DLE** | es | Royal Spanish Academy Dictionary — authority on Spanish usage. | `rae.es` — API for definitions (for linguistic acceptability tasks) |
+| **CREA / Corpus de Referencia del Español Actual** | es | RAE reference corpus of modern Spanish usage. | `rae.es` — may require academic agreement |
+| **CORLEC / Corpus Oral de Lenguaje de Especialidad** | es | Spanish specialized language corpus. | UAM / UCM — academic access |
+---
+### 🗣️ 6. Regional & Minority Languages of Spain (Beyond Co-Official)
+The current La Leaderboard covers Spanish, Catalan, Basque, and Galician. Spain has additional recognized languages that lack LLM benchmarks:
+| Language | Status | Existing Resources | What's Needed |
+|----------|--------|-------------------|---------------|
+| **Aragonese** | Recognized (Aragón) | FLORES+ (WMT 2024): `oldi.org/cards/flores/arg_Latn` | Reading comprehension, QA, NLI benchmark |
+| **Asturian / Bable** | Recognized (Asturias) | FLORES+: `oldi.org/cards/flores/ast_Latn`, PILAR project | Reading comprehension, QA |
+| **Leonese** | Recognized (Castilla y León) | Very limited | Needs corpus creation from oral history projects |
+| **Extremaduran** | Recognized (Extremadura) | Very limited | Needs community collection |
+| **Fala** | Recognized (Extremadura) | Very limited | Needs community collection |
+| **Occitan / Aranese** | Co-official in Val d'Aran (Catalonia) | FLORES+, some Aranese school materials | Reading comprehension, basic QA |
+| **Caló** | Historical Romani variety in Spain | Very limited | Ethical considerations; work with Roma communities |
+**Suggested approach**: Partner with regional language academies (Academia de la Llingua Asturiana, Academia Aragonesa de la Lengua) to digitize school textbooks and create reading comprehension + QA benchmarks.
+---
+### 📚 7. Existing Spanish NLP Benchmarks & Shared Tasks
+Many evaluation datasets have been produced by Spanish NLP shared tasks but are not yet integrated into LLM leaderboards:
+| Benchmark / Shared Task | Organizer | Task | Years | Data Access |
+|------------------------|-----------|------|-------|-------------|
+| **TASS** (Sentiment Analysis in Spanish) | SEPLN | Sentiment, emotion, irony detection | 2012–present | `tass.sepln.org` — download page |
+| **IberLEF** | SEPLN / Iberian NLP community | Multiple tasks: NER, sentiment, fake news, hate speech, etc. | 2019–present | `ceur-ws.org` proceedings + Zenodo |
+| **Hackathon del ML4ALL** | Various | Spanish-specific ML challenges | Ongoing | GitHub repos |
+| **PlanTL-GOB-ES** | Spanish government NLP initiative | STS-es, SQAC, EsExams, etc. | 2020–present | `huggingface.co/PlanTL-GOB-ES` (some gated) |
+| **HiTZ evaluation suite** | HiTZ (Basque) | EusExams, EusProficiency, BertaQA, etc. | 2022–present | `huggingface.co/HiTZ` |
+| **AINA / BSC** | BSC (Catalan) | CatalanQA, caBreu, CoQCat, etc. | 2023–present | `huggingface.co/projecte-aina` |
+| **ILENIA / USC** | USC (Galician) | GalCoLA, summarization, parafrases | 2023–present | `huggingface.co/proxectonos` |
+| **MedExpQA** | Unknown (mentioned in paper) | Medical explanatory QA in Spanish | — | Contact authors |
+| **Conan EUS** | HiTZ | Counter-narrative generation (Basque/Spanish) | — | Contact authors |
+| **H4rmony Eval** | Unknown | Ethics evaluation in Spanish | — | Contact authors |
+| **VeritasQA** | Unknown | Truthfulness QA in ES/CA/GL | — | Contact authors |
 ---
+### 🏛️ 8. Institutional & Government Data Portals
+These portals contain raw data that could be transformed into evaluation datasets:
+| Portal | Institution | Content | How to Use |
+|--------|-------------|---------|------------|
+| **datos.gob.es** | Spanish open data portal | Datasets from all ministries | Filter for text-heavy datasets; generate QA |
+| **transparencia.gob.es** | Government transparency | Reports, budgets, contracts | Summarization, NLI, document QA |
+| **boe.es** | Boletín Oficial del Estado | All Spanish legislation | Legal comprehension, summarization |
+| **senado.es / congreso.es** | Parliament | Debates, bills, questions | Political reasoning, summarization |
+| **ine.es** | National Statistics Institute | Statistical reports, surveys | Data-to-text generation, QA |
+| **aemet.es** | Meteorology agency | Weather forecasts, alerts | Domain-specific text generation |
+| **eldiario.es / elpais.com / 20minutos.es** | Press (archives) | News articles (with paywalls) | Summarization, NLI, but copyright-restricted |
+| **rtve.es / rtve.es/alacarta** | Public broadcaster | TV/radio transcripts, subtitles | Public service content; potential for speech-to-text eval |
+| **bne.es** | National Library | Digitized books, newspapers | Historical Spanish text; OCR quality varies |
+---
+### 🎯 9. Recommended Priorities for Dataset Expansion
+Based on the research above, we recommend the following priority order for expanding La Leaderboard's cultural and linguistic coverage:
+| Priority | Dataset | Effort | Impact | Lead |
+|----------|---------|--------|--------|------|
+| 🔴 **P0** | **EBAU/Selectividad digitization** (all 17 CCAA, 2015–2025) | High (scraping + PDF parsing + manual review) | Very High — tests general knowledge + reasoning in Spanish | Community + NLP Spain |
+| 🔴 **P0** | **Spain Cultural Knowledge Benchmark** (*IberiaCult*) | Medium (human annotation + AI generation) | Very High — unique cultural evaluation | Cultural institutions + NLP Spain |
+| 🟡 **P1** | **Expand CasiMedicos** with 2023–2025 MIR/EIR/QIR/FIR/PIR | Low-Medium (update existing pipeline) | High — medical reasoning in Spanish | HiTZ / medical community |
+| 🟡 **P1** | **Legal benchmark** from BOE + CENDOJ + bar exam prep | Medium | High — legal reasoning | Law faculties + NLP Spain |
+| 🟡 **P1** | **PAES Chile + LATAM exams** | Medium | High — LATAM variety coverage | LATAM NLP communities |
+| 🟢 **P2** | **Aragonese / Asturian / Occitan benchmarks** | Medium-High (corpus creation) | Medium — linguistic diversity | Regional language academies |
+| 🟢 **P2** | **TASS / IberLEF task integration** | Low (data already exists) | Medium | SEPLN community |
+| 🔵 **P3** | **Valencian-specific tasks** (distinct from Catalan) | Low-Medium | Medium — regional identity | AINA / Valencia NLP |
+| 🔵 **P3** | **Portuguese (Brazil + Portugal) benchmarks** | Low-Medium | Medium — Lusophone coverage | Portugal / Brazil NLP |
+---
+### 🤝 How to Contribute a Dataset
+1. **Check** if the dataset is already on HuggingFace Hub or listed above
+2. **Prepare** the data in one of these formats:
+   - **MCQ**: `{"question": "...", "choices": ["A", "B", "C", "D"], "answer": 0, ...}`
+   - **QA**: `{"question": "...", "context": "...", "answers": ["..."], ...}`
+   - **NLI**: `{"premise": "...", "hypothesis": "...", "label": "entailment|neutral|contradiction"}`
+   - **Summarization**: `{"article": "...", "summary": "..."}`
+3. **Create** a HuggingFace Dataset repo with a Dataset Card
+4. **Write** a YAML task definition for `lm-eval-harness` (see examples above)
+5. **Open** a discussion on [La Leaderboard v2 Community](https://huggingface.co/spaces/pauvanbr/la-leaderboard-v2/discussions) or submit a PR
+---
+### 📖 References & Further Reading
+| Paper / Resource | Relevance |
+|-----------------|-----------|
+| *La Leaderboard* (ACL 2025, arXiv:2507.00999) | Original leaderboard methodology |
+| *IberBench* (arXiv:2504.16921) | Complementary Iberian benchmark with 101 datasets |
+| *LLMzSzŁ* (arXiv:2501.02266) | Polish national exam digitization pipeline — model for EBAU |
+| *KMMLU* / *Open Ko-LLM* (arXiv:2402.11548, 2410.12445) | Korean exam benchmarks + private test sets |
+| *CulturalBench* (arXiv:2410.02677) | Methodology for cultural knowledge benchmarks |
+| *BLEnD* (arXiv:2406.09948) | Multicultural QA across 16 countries |
+| *BertaQA* (arXiv:2406.07302) | Basque cultural trivia — model for regional cultural QA |
+| *EXAMS* (EMNLP 2020, arXiv:2011.03080) | Multilingual high school exams (includes 235 Spanish questions) |
+| *CasiMedicos* (arXiv:2503.00025) | Spanish MIR evaluation |
+| *PARIKSHA* (arXiv:2406.15053) | Human-LLM evaluator agreement on multicultural data |
 ---
+*This README and annex are living documents. If you know of a dataset, exam, or benchmark source not listed here, please open a discussion or PR!* 💛