---
language:
- es
- it
- ro
- en
- nl
- sv
- cs
license: cc-by-4.0
library_name: PyTorch
pipeline_tag: token-classification
tags:
- ner
- named-entity-recognition
- token-classification
- clinical-ner
- biomedical-ner
- multilingual
- multi-task-learning
- multilabel
datasets:
- distemist
- symptemist
- medprocner
- cardioccc
metrics:
- strict precision
- strict recall
- strict f1
- character-level precision
- character-level recall
- character-level f1
base_model: xlm-roberta-base
---

# DT4H_XLM-R_mtl_multilingual_multilabel

## Model Description

This **multilingual clinical Named Entity Recognition (NER)** model is designed to identify **disease**, **symptom**, and **clinical procedure**  mentions in biomedical and clinical text. It is based on [`xlm-roberta-base`](https://huggingface.co/FacebookAI/xlm-roberta-base) and fine-tuned on translated variants of the clinical NER datasets **DisTEMIST**,  **SympTEMIST**, **MedProcNER**, and **CardioCCC**, which consist of clinical case reports with manually annotated mentions of three entity types, following a **multi-task learning (MTL)** approach and using the BIO tagging scheme for sequence labeling.

The model consists of a **shared multilingual encoder** and a set of **entity-specific token classification heads**, each one being responsible for a different task. In this configuration, each classification head is trained on **entity-specific data from all supported languages**.

- **Architecture:** Multi-task learning (MTL)  
- **Training setup:** Multilingual, Multilabel  (DISEASE, SYMPTOM, PROCEDURE)
- **Tasks:** `disease`, `symptom`, `procedure`
- **Supported languages:**
  - Spanish (`es`)
  - Italian (`it`)
  - Romanian (`ro`)
  - English (`en`)
  - Dutch (`nl`)
  - Swedish (`sv`)
  - Czech (`cs`)
- **Base model:** `xlm-roberta-base`  
- **Task:** Token classification (NER)  
- **Label scheme:** BIO

## Training Data

The model is trained on multilingual clinical NER data combining **DisTEMIST**,  **SympTEMIST**, **MedProcNER**, and **CardioCCC** across the supported languages. In this MTL set up, each classification head is trained to perform a specific task, which is the detection of mentions of a given entity type in biomedical and clinical texts. For that, each classification head is fine-tuned on entity-specific data from all supported languages.

The training data is provided as part of the **MultiClinNER** subtask of the [**MultiClinAI**](https://temu.bsc.es/MultiClinAI/) shared task, an initiative as part of the DataTools4Heart (DT4H) project, which provides translated and annotation-projected clinical corpora.

Training and test splits correspond to the **MultiClinNER** task at the 11th SMM4H-HeaRD Workshop (ACL 2026).

## Model loading

This model uses a **custom MTL architecture**, and therefore cannot be loaded with:

```python
AutoModelForTokenClassification.from_pretrained(...)
```

Instead, this repository provides a **PyTorch checkpoint (`.pt`)** which includes:

- Encoder weights
- All entity-specific heads

## How to use

To use the model:

1. Download the `.pt` file
2. Load it using the custom architecture

To facilitate this process, we provide an inference script in a [GitHub repository]() that:

- Loads the model from the checkpoint using the custom architecture
- Processes .txt files from an input directory
- Extracts mentions of the entity type corresponding to the selected classification head (task)
- Exports predictions as a TSV file in the format required for the [MultiClinAI evaluation library](https://github.com/nlp4bia-bsc/MultiClinAIEval):

```text
filename                              label      start_span    end_span    text
MultiClinNER-en-test-disease-00019    DISEASE    154           165         myocarditis
```

⚠ **Note:** We recommend pre-tokenizing the input text into words, as this matches the training setup. Providing raw text directly may lead to slightly degraded performance.

## Limitations and bias

At the time of submission, no formal bias or fairness evaluation has been conducted. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.

## Evaluation

Evaluation was conducted using **strict (exact match)** and **character-level** metrics on the MultiClinNER test set.

## Evaluation Results (MTL, Multilingual + Multilabel - DISEASE)

| Language | Strict P | Strict R | Strict F1 | Char P | Char R | Char F1 |
|----------|---------:|---------:|----------:|-------:|-------:|--------:|
| es | 0.5024 | 0.5949 | 0.5448 | 0.6437 | 0.7488 | 0.6923 |
| it | 0.5202 | 0.5087 | 0.5144 | 0.6614 | 0.6380 | 0.6495 |
| ro | 0.5536 | 0.5693 | 0.5613 | 0.6938 | 0.7071 | 0.7004 |
| en | 0.5476 | 0.4907 | 0.5176 | 0.6825 | 0.5996 | 0.6384 |
| nl | 0.5118 | 0.5283 | 0.5199 | 0.6323 | 0.6417 | 0.6370 |
| sv | 0.4821 | 0.5416 | 0.5101 | 0.6027 | 0.6635 | 0.6316 |
| cz | 0.4802 | 0.5007 | 0.4902 | 0.6107 | 0.6242 | 0.6174 |
| **Average** |  |  | **0.5226** |  |  | **0.6524** |

---

## Evaluation Results (MTL, Multilingual + Multilabel - SYMPTOM)

| Language | Strict P | Strict R | Strict F1 | Char P | Char R | Char F1 |
|----------|---------:|---------:|----------:|-------:|-------:|--------:|
| es | 0.1862 | 0.1734 | 0.1796 | 0.3518 | 0.3208 | 0.3356 |
| it | 0.2087 | 0.1478 | 0.1730 | 0.3714 | 0.2564 | 0.3033 |
| ro | 0.1819 | 0.1430 | 0.1602 | 0.3452 | 0.2644 | 0.2995 |
| en | 0.2501 | 0.1568 | 0.1927 | 0.4264 | 0.2618 | 0.3244 |
| nl | 0.2609 | 0.2120 | 0.2339 | 0.3985 | 0.3149 | 0.3518 |
| sv | 0.2429 | 0.2080 | 0.2241 | 0.3932 | 0.3257 | 0.3563 |
| cz | 0.1872 | 0.1718 | 0.1792 | 0.3296 | 0.2917 | 0.3095 |
| **Average** |  |  | **0.1918** |  |  | **0.3258** |

---

## Evaluation Results (MTL, Multilingual + Multilabel - PROCEDURE)

| Language | Strict P | Strict R | Strict F1 | Char P | Char R | Char F1 |
|----------|---------:|---------:|----------:|-------:|-------:|--------:|
| es | 0.6119 | 0.6336 | 0.6226 | 0.7575 | 0.7769 | 0.7671 |
| it | 0.6094 | 0.5043 | 0.5519 | 0.7561 | 0.6209 | 0.6819 |
| ro | 0.6565 | 0.6351 | 0.6456 | 0.7914 | 0.7599 | 0.7753 |
| en | 0.6169 | 0.5757 | 0.5956 | 0.7757 | 0.7192 | 0.7464 |
| nl | 0.6266 | 0.6203 | 0.6234 | 0.7526 | 0.7404 | 0.7465 |
| sv | 0.6131 | 0.6317 | 0.6223 | 0.7448 | 0.7614 | 0.7530 |
| cz | 0.6104 | 0.6269 | 0.6185 | 0.7527 | 0.7644 | 0.7585 |
| **Average** |  |  | **0.6114** |  |  | **0.7470** |

## Additional information

### Authors
NLP4BIA team at the Barcelona Supercomputing Center (nlp4bia@bsc.es).

### Contact information
judith.rosell [at] bsc.es

### Funding

This model is part of the DataTools4Heart project, funded by the European Union’s Horizon Europe Framework Under Grant Agreement No. 101057849.