--- language: - es - ca license: apache-2.0 tags: - modernbert - tourism - spanish - valencian - fill-mask - encoder - bert - continual-pretraining datasets: - gplsi/alia_tourism base_model: BSC-LT/MrBERT library_name: transformers pipeline_tag: fill-mask --- # Aitana-Tourism-Encoder (Spanish & Valencian) A **ModernBERT-base** model continually pretrained on **tourism domain** data in **Spanish** and **Valencian**. This specialized encoder model is optimized for understanding tourism-related texts, including hotel descriptions, destination guides, travel services, and cultural heritage content. ## Table of Contents - [Model Description](#model-description) - [Training Data](#training-data) - [Training Configuration](#training-configuration) - [Training Results](#training-results) - [Intended Uses](#intended-uses) - [How to Use](#how-to-use) - [Additional Information](#additional-information) ## Model Description | Attribute | Value | |-----------|-------| | **Base Model** | [BSC-LT/MrBERT](https://huggingface.co/BSC-LT/MrBERT) | | **Architecture** | FlexBERT (22 layers, 768 hidden, 12 heads) | | **Parameters** | ~149M | | **Vocabulary Size** | 256,000 tokens | | **Max Sequence Length** | 8,192 tokens | | **Languages** | Spanish (es), Valencian (va) | | **Domain** | Tourism | ## Training Data This model was trained on the [gplsi/alia_tourism](https://huggingface.co/datasets/gplsi/alia_tourism) dataset, filtered for Spanish and Valencian languages. ### Dataset Statistics | Metric | Value | |--------|-------| | **Total Documents** | 66,548 | | **Spanish Documents** | 49,644 (74.6%) | | **Valencian Documents** | 16,904 (25.4%) | | **Raw Text Size** | 1.2 GB | | **Training Samples** | 80,839 | | **Validation Samples** | 8,862 | | **Total Tokens (Train)** | ~348 million | | **Tokens Seen (4 epochs)** | ~1.39 billion | ### Data Processing Pipeline 1. **Download**: Extracted from [gplsi/alia_tourism](https://huggingface.co/datasets/gplsi/alia_tourism) HuggingFace dataset 2. **Filtering**: Selected only `language=["es", "va"]` subsets 3. **Tokenization**: BPE tokenization with MrBERT tokenizer (256k vocab) 4. **Chunking**: Packed into 8,192-token sequences 5. **Split**: 90% train / 10% validation ## Training Configuration | Parameter | Value | |-----------|-------| | **Training Epochs** | 4 | | **Sequence Length** | 8,192 | | **MLM Probability** | 30% (train), 15% (eval) | | **Batch Size** | 32 | | **Learning Rate** | 5e-5 (cosine decay to 5e-6) | | **Warmup** | 101 batches (1%) | | **Optimizer** | StableAdamW | | **Precision** | bfloat16 | | **Hardware** | 1× NVIDIA RTX 4090 | ## Training Results | Epoch | Training Loss | Masked Accuracy | |-------|--------------|-----------------| | 1 | 2.84 → 1.30 | 80.64% → 84.39% | | 2 | 1.07 → 1.05 | 85.67% | | 3 | 0.92 → 1.26 | 86.11% | | **Final** | **1.26** | **86.11%** | ### Key Achievements - ✅ **87% loss reduction** (9.4 → 1.26) - ✅ **+5.5 pp accuracy gain** (80.6% → 86.1%) - ✅ **No overfitting** observed - ✅ **Stable gradients** throughout training ## Intended Uses ### Primary Use Cases - **Tourism NLP**: Named entity recognition, text classification, sentiment analysis for tourism content - **Semantic Search**: Document retrieval and similarity for travel-related queries - **Information Extraction**: Extracting entities like hotels, destinations, amenities - **Multilingual Tourism**: Processing Spanish and Valencian tourism texts ### Out-of-Scope Uses - General-purpose language understanding outside tourism domain - Languages other than Spanish and Valencian - Text generation (this is an encoder-only model) ### Limitations - **Domain-specific**: Performance may degrade on non-tourism texts - **Language coverage**: Optimized for Spanish (es) and Valencian (va) only - **Encoder-only**: Cannot generate text, only encode/understand ### Ethical Considerations The training data is automatically curated from tourism sources and may contain: - Geographic and cultural biases toward specific regions - Commercial content from tourism businesses - Limited representation of certain destinations or services Users should evaluate the model's outputs for fairness and bias in their specific applications. ## How to Use ### Transformers ```python from transformers import AutoModelForMaskedLM, AutoTokenizer model = AutoModelForMaskedLM.from_pretrained("gplsi/Aitana-tourism-mb-encoder-1.0") tokenizer = AutoTokenizer.from_pretrained("gplsi/Aitana-tourism-mb-encoder-1.0") # Fill-mask example text = "El hotel ofrece vistas [MASK] al mar Mediterráneo." inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) # Get predictions import torch mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1] predicted_token_id = outputs.logits[0, mask_token_index].argmax(axis=-1) print(tokenizer.decode(predicted_token_id)) ``` ### For Embeddings ```python from transformers import AutoModel, AutoTokenizer import torch model = AutoModel.from_pretrained("gplsi/Aitana-tourism-mb-encoder-1.0") tokenizer = AutoTokenizer.from_pretrained("gplsi/Aitana-tourism-mb-encoder-1.0") text = "Descubre las playas de la Costa Blanca" inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) with torch.no_grad(): outputs = model(**inputs) embeddings = outputs.last_hidden_state.mean(dim=1) # Mean pooling ``` ## Evaluation This model was evaluated using the GLUE and SuperGLUE benchmarks. | Suite | Task/Metric | Seeds | Scores by seed | Mean +/- std | |---|---:|---:|---|---:| | GLUE | CoLA MCC | 4 | 19: 7.05, 8364: 8.50, 717: 12.39, 10536: 11.77 | 9.93 +/- 2.57 | | GLUE | MNLI acc | 1 | 19: 65.26 | 65.26 | | GLUE | MNLI-mm acc | 1 | 19: 65.69 | 65.69 | | GLUE | MRPC F1 | 5 | 19: 74.46, 8364: 73.59, 717: 70.83, 10536: 73.16, 90166: 73.74 | 73.16 +/- 1.38 | | GLUE | QNLI acc | 1 | 19: 63.44 | 63.44 | | GLUE | QQP F1 | 1 | 19: 78.97 | 78.97 | | GLUE | RTE acc | 5 | 19: 49.10, 8364: 48.38, 717: 52.71, 10536: 53.43, 90166: 50.18 | 50.76 +/- 2.22 | | GLUE | SST-2 acc | 3 | 19: 80.39, 8364: 80.28, 717: 80.73 | 80.47 +/- 0.24 | | GLUE | STS-B Spearman | 5 | 19: 29.77, 8364: 24.73, 717: 28.84, 10536: 25.05, 90166: 29.55 | 27.59 +/- 2.49 | | SuperGLUE partial | MNLI acc | 1 | 19: 66.39 | 66.39 | | SuperGLUE partial | MNLI-mm acc | 1 | 19: 67.48 | 67.48 | | SuperGLUE partial | RTE acc | 5 | 19: 52.35, 8364: 48.01, 717: 53.79, 10536: 54.87, 90166: 50.54 | 51.91 +/- 2.72 | | SuperGLUE partial | BoolQ acc | 3 | 23: 64.80, 42: 64.68, 6033: 65.75 | 65.08 +/- 0.59 | | SuperGLUE partial | CB acc | 3 | 23: 71.43, 42: 71.43, 6033: 69.64 | 70.83 +/- 1.03 | | SuperGLUE partial | CB F1 | 3 | 23: 59.05, 42: 49.89, 6033: 58.92 | 55.95 +/- 5.25 | | SuperGLUE partial | COPA acc | 5 | 23: 54.00, 42: 48.00, 6033: 47.00, 1337: 53.00, 24: 49.00 | 50.20 +/- 3.11 | | SuperGLUE partial | SWAG acc | 1 | 19: 27.55 | 27.55 | | SuperGLUE partial | WiC acc | 3 | 23: 57.21, 42: 57.84, 6033: 57.37 | 57.47 +/- 0.33 | ## Additional Information ### Author The model has been developed by the **Language and [Information Systems Group (GPLSI)](https://gplsi.dlsi.ua.es/)** and the **[Centro de Inteligencia Digital (CENID)](https://cenid.es)**, both part of the **[University of Alicante (UA)](https://www.ua.es/es/)**, as part of their ongoing research in **Natural Language Processing (NLP)**. ### Funding This work is funded by the **Ministerio para la Transformación Digital y de la Función Pública**, co-financed by the **EU – NextGenerationEU**, within the framework of the project **Desarrollo de Modelos ALIA**. ### Acknowledgments We would like to express our gratitude to all individuals and institutions that have contributed to the development of this work. Special thanks to: - [Language Technologies Laboratory at Barcelona Supercomputing Center](https://www.bsc.es/es/discover-bsc/organisation/research-structure/language-technologies-laboratory) - [Centro Vasco de Tecnología de la Lengua (HiTZ)](https://www.hitz.eus/es) - [Centro Singular de Investigación en Tecnologías Inteligentes (CiTIUS)](https://citius.gal/) - [Sistemas Inteligentes de Acceso a la Información (SINAI)](https://www.ujaen.es/investigacion-y-transferencia/grupos-de-investigacion/sistemas-inteligentes-de-acceso-la-informacion-sinai) - [Instituto Universitario de Investigación Informática (IUII)](https://web.ua.es/es/iuii/) - [Leonardo HPC System](https://leonardo-supercomputer.cineca.eu/) - [European supercomputing ecosystem (EUROHPC)](https://www.eurohpc-ju.europa.eu/) - [MrBERT](https://arxiv.org/abs/2602.21379) for the original model We also acknowledge the financial, technical, and scientific support of the **Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA**, whose contribution has been essential to the completion of this research. ### License This model is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). ### Disclaimer This model is intended for general purposes and is available under a permissive Apache License 2.0. Be aware that the model may have biases and/or undesirable outputs. Users deploying systems based on this model are responsible for mitigating risks and complying with applicable AI regulations. ### Reference If you use this model, please cite: ```bibtex @misc{modernbert-tourism-2025, author = {Yáñez-Romero, Fabio and Sepúlveda-Torres, Robiert and Estevanell-Valladares, Ernesto L. and Galeano, Santiago and Martínez-Murillo, Iván and Grande, Eduardo and Canal-Esteve, Miquel and Miró Maestre, María and Bonora, Mar and Gutierrez, Yoan and Abreu Salas, José Ignacio and Consuegra-Ayala, Juan Pablo and Lloret, Elena and Montoyo, Andrés and Muñoz-Guillena and Palomar, Manuel}, title = {Aitana Tourism Encoder: Domain-Adapted Language Model for Spanish and Valencian Tourism}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/gplsi/Aitana-tourism-mb-encoder-1.0}} } ``` --- **Copyright © 2025 Language and Information Systems Group (GPLSI) and Centro de Inteligencia Digital (CENID), University of Alicante (UA). Distributed under the Apache License 2.0.**