---
language:
  - es
  - ca
license: apache-2.0
tags:
  - modernbert
  - tourism
  - spanish
  - valencian
  - fill-mask
  - encoder
  - bert
  - continual-pretraining
datasets:
  - gplsi/alia_tourism
base_model: BSC-LT/MrBERT
library_name: transformers
pipeline_tag: fill-mask
---

# Aitana-Tourism-Encoder  (Spanish & Valencian)


A **ModernBERT-base** model continually pretrained on **tourism domain** data in **Spanish** and **Valencian**. This specialized encoder model is optimized for understanding tourism-related texts, including hotel descriptions, destination guides, travel services, and cultural heritage content.

## Table of Contents

- [Model Description](#model-description)
- [Training Data](#training-data)
- [Training Configuration](#training-configuration)
- [Training Results](#training-results)
- [Intended Uses](#intended-uses)
<!-- [Evaluation](#evaluation)-->
- [How to Use](#how-to-use)
- [Additional Information](#additional-information)

## Model Description

| Attribute | Value |
|-----------|-------|
| **Base Model** | [BSC-LT/MrBERT](https://huggingface.co/BSC-LT/MrBERT) |
| **Architecture** | FlexBERT (22 layers, 768 hidden, 12 heads) |
| **Parameters** | ~149M |
| **Vocabulary Size** | 256,000 tokens |
| **Max Sequence Length** | 8,192 tokens |
| **Languages** | Spanish (es), Valencian (va) |
| **Domain** | Tourism |

## Training Data

This model was trained on the [gplsi/alia_tourism](https://huggingface.co/datasets/gplsi/alia_tourism) dataset, filtered for Spanish and Valencian languages.

### Dataset Statistics

| Metric | Value |
|--------|-------|
| **Total Documents** | 66,548 |
| **Spanish Documents** | 49,644 (74.6%) |
| **Valencian Documents** | 16,904 (25.4%) |
| **Raw Text Size** | 1.2 GB |
| **Training Samples** | 80,839 |
| **Validation Samples** | 8,862 |
| **Total Tokens (Train)** | ~348 million |
| **Tokens Seen (4 epochs)** | ~1.39 billion |

### Data Processing Pipeline

1. **Download**: Extracted from [gplsi/alia_tourism](https://huggingface.co/datasets/gplsi/alia_tourism) HuggingFace dataset
2. **Filtering**: Selected only `language=["es", "va"]` subsets
3. **Tokenization**: BPE tokenization with MrBERT tokenizer (256k vocab)
4. **Chunking**: Packed into 8,192-token sequences
5. **Split**: 90% train / 10% validation

## Training Configuration

| Parameter | Value |
|-----------|-------|
| **Training Epochs** | 4 |
| **Sequence Length** | 8,192 |
| **MLM Probability** | 30% (train), 15% (eval) |
| **Batch Size** | 32 |
| **Learning Rate** | 5e-5 (cosine decay to 5e-6) |
| **Warmup** | 101 batches (1%) |
| **Optimizer** | StableAdamW |
| **Precision** | bfloat16 |
| **Hardware** | 1× NVIDIA RTX 4090 |

## Training Results

| Epoch | Training Loss | Masked Accuracy |
|-------|--------------|-----------------|
| 1 | 2.84 → 1.30 | 80.64% → 84.39% |
| 2 | 1.07 → 1.05 | 85.67% |
| 3 | 0.92 → 1.26 | 86.11% |
| **Final** | **1.26** | **86.11%** |

### Key Achievements

- ✅ **87% loss reduction** (9.4 → 1.26)
- ✅ **+5.5 pp accuracy gain** (80.6% → 86.1%)
- ✅ **No overfitting** observed
- ✅ **Stable gradients** throughout training


## Intended Uses

### Primary Use Cases

- **Tourism NLP**: Named entity recognition, text classification, sentiment analysis for tourism content
- **Semantic Search**: Document retrieval and similarity for travel-related queries
- **Information Extraction**: Extracting entities like hotels, destinations, amenities
- **Multilingual Tourism**: Processing Spanish and Valencian tourism texts


### Out-of-Scope Uses

- General-purpose language understanding outside tourism domain
- Languages other than Spanish and Valencian
- Text generation (this is an encoder-only model)

### Limitations

- **Domain-specific**: Performance may degrade on non-tourism texts
- **Language coverage**: Optimized for Spanish (es) and Valencian (va) only
- **Encoder-only**: Cannot generate text, only encode/understand

### Ethical Considerations

The training data is automatically curated from tourism sources and may contain:
- Geographic and cultural biases toward specific regions
- Commercial content from tourism businesses
- Limited representation of certain destinations or services

Users should evaluate the model's outputs for fairness and bias in their specific applications.


## How to Use

### Transformers

```python
from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained("gplsi/Aitana-tourism-mb-encoder-1.0")
tokenizer = AutoTokenizer.from_pretrained("gplsi/Aitana-tourism-mb-encoder-1.0")

# Fill-mask example
text = "El hotel ofrece vistas [MASK] al mar Mediterráneo."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Get predictions
import torch
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted_token_id = outputs.logits[0, mask_token_index].argmax(axis=-1)
print(tokenizer.decode(predicted_token_id))
```

### For Embeddings

```python
from transformers import AutoModel, AutoTokenizer
import torch

model = AutoModel.from_pretrained("gplsi/Aitana-tourism-mb-encoder-1.0")
tokenizer = AutoTokenizer.from_pretrained("gplsi/Aitana-tourism-mb-encoder-1.0")

text = "Descubre las playas de la Costa Blanca"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)  # Mean pooling
```


## Evaluation

This model was evaluated using the GLUE and SuperGLUE benchmarks.

| Suite | Task/Metric | Seeds | Scores by seed | Mean +/- std |
|---|---:|---:|---|---:|
| GLUE | CoLA MCC | 4 | 19: 7.05, 8364: 8.50, 717: 12.39, 10536: 11.77 | 9.93 +/- 2.57 |
| GLUE | MNLI acc | 1 | 19: 65.26 | 65.26 |
| GLUE | MNLI-mm acc | 1 | 19: 65.69 | 65.69 |
| GLUE | MRPC F1 | 5 | 19: 74.46, 8364: 73.59, 717: 70.83, 10536: 73.16, 90166: 73.74 | 73.16 +/- 1.38 |
| GLUE | QNLI acc | 1 | 19: 63.44 | 63.44 |
| GLUE | QQP F1 | 1 | 19: 78.97 | 78.97 |
| GLUE | RTE acc | 5 | 19: 49.10, 8364: 48.38, 717: 52.71, 10536: 53.43, 90166: 50.18 | 50.76 +/- 2.22 |
| GLUE | SST-2 acc | 3 | 19: 80.39, 8364: 80.28, 717: 80.73 | 80.47 +/- 0.24 |
| GLUE | STS-B Spearman | 5 | 19: 29.77, 8364: 24.73, 717: 28.84, 10536: 25.05, 90166: 29.55 | 27.59 +/- 2.49 |
| SuperGLUE partial | MNLI acc | 1 | 19: 66.39 | 66.39 |
| SuperGLUE partial | MNLI-mm acc | 1 | 19: 67.48 | 67.48 |
| SuperGLUE partial | RTE acc | 5 | 19: 52.35, 8364: 48.01, 717: 53.79, 10536: 54.87, 90166: 50.54 | 51.91 +/- 2.72 |
| SuperGLUE partial | BoolQ acc | 3 | 23: 64.80, 42: 64.68, 6033: 65.75 | 65.08 +/- 0.59 |
| SuperGLUE partial | CB acc | 3 | 23: 71.43, 42: 71.43, 6033: 69.64 | 70.83 +/- 1.03 |
| SuperGLUE partial | CB F1 | 3 | 23: 59.05, 42: 49.89, 6033: 58.92 | 55.95 +/- 5.25 |
| SuperGLUE partial | COPA acc | 5 | 23: 54.00, 42: 48.00, 6033: 47.00, 1337: 53.00, 24: 49.00 | 50.20 +/- 3.11 |
| SuperGLUE partial | SWAG acc | 1 | 19: 27.55 | 27.55 |
| SuperGLUE partial | WiC acc | 3 | 23: 57.21, 42: 57.84, 6033: 57.37 | 57.47 +/- 0.33 |

## Additional Information

### Author

The model has been developed by the **Language and [Information Systems Group (GPLSI)](https://gplsi.dlsi.ua.es/)** and the **[Centro de Inteligencia Digital (CENID)](https://cenid.es)**, both part of the **[University of Alicante (UA)](https://www.ua.es/es/)**, as part of their ongoing research in **Natural Language Processing (NLP)**.


### Funding

This work is funded by the **Ministerio para la Transformación Digital y de la Función Pública**, co-financed by the **EU – NextGenerationEU**, within the framework of the project **Desarrollo de Modelos ALIA**.

### Acknowledgments

We would like to express our gratitude to all individuals and institutions that have contributed to the development of this work.

Special thanks to:
- [Language Technologies Laboratory at Barcelona Supercomputing Center](https://www.bsc.es/es/discover-bsc/organisation/research-structure/language-technologies-laboratory) 
- [Centro Vasco de Tecnología de la Lengua (HiTZ)](https://www.hitz.eus/es)
- [Centro Singular de Investigación en Tecnologías Inteligentes (CiTIUS)](https://citius.gal/)
- [Sistemas Inteligentes de Acceso a la Información (SINAI)](https://www.ujaen.es/investigacion-y-transferencia/grupos-de-investigacion/sistemas-inteligentes-de-acceso-la-informacion-sinai)
- [Instituto Universitario de Investigación Informática (IUII)](https://web.ua.es/es/iuii/)
- [Leonardo HPC System](https://leonardo-supercomputer.cineca.eu/)
- [European supercomputing ecosystem (EUROHPC)](https://www.eurohpc-ju.europa.eu/)
- [MrBERT](https://arxiv.org/abs/2602.21379) for the original model


We also acknowledge the financial, technical, and scientific support of the **Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA**, whose contribution has been essential to the completion of this research.


### License

This model is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).


### Disclaimer

This model is intended for general purposes and is available under a permissive Apache License 2.0. Be aware that the model may have biases and/or undesirable outputs. Users deploying systems based on this model are responsible for mitigating risks and complying with applicable AI regulations.


### Reference

If you use this model, please cite:

```bibtex
@misc{modernbert-tourism-2025,
  author = {Yáñez-Romero, Fabio and Sepúlveda-Torres, Robiert and Estevanell-Valladares, Ernesto L. and Galeano, Santiago and Martínez-Murillo, Iván and Grande, Eduardo and Canal-Esteve, Miquel and Miró Maestre, María and Bonora, Mar and Gutierrez, Yoan and Abreu Salas, José Ignacio and Consuegra-Ayala, Juan Pablo and Lloret, Elena and Montoyo, Andrés and Muñoz-Guillena and Palomar, Manuel},
  title = {Aitana Tourism Encoder: Domain-Adapted Language Model for Spanish and Valencian Tourism},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/gplsi/Aitana-tourism-mb-encoder-1.0}}
}
```

---

**Copyright © 2025 Language and Information Systems Group (GPLSI) and Centro de Inteligencia Digital (CENID),
University of Alicante (UA).
Distributed under the Apache License 2.0.**