BSC-NLP4BIA
/

multiclinner-ro-symptom-xlm-roberta

@@ -41,7 +41,7 @@ model-index:
 <summary>Click to expand</summary>
 - [Model description](#model-description)
-- [How to use](#how-to-use)
 - [Evaluation](#evaluation)
 - [Limitations and bias](#limitations-and-bias)
 - [Additional information](#additional-information)
@@ -56,21 +56,68 @@ model-index:
 ## Model description
 A fine-tuned version of [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) for **SYMPTOM Named Entity Recognition** in Romanian clinical text. It was developed for MultiClinNER, a subtask of the [MultiClinAI shared task](https://temu.bsc.es/MultiClinAI/) organized by the NLP4BIA team at the Barcelona Supercomputing Center as part of the #SMM4H-HeaRD Workshop at the ACL 2026 conference. The model labels mentions of **SYMPTOM** using the BIO tagging scheme: `O`, `B-SYMPTOM`, `I-SYMPTOM`.
-## How to use
-⚠ We recommend pre-tokenizing the input text into words instead of providing it directly to the model, as this is how the model was trained. Otherwise, the results and performance might get affected.
 ```python
 from transformers import pipeline
-ner = pipeline(
-    "token-classification",
-    model="BSC-NLP4BIA/multiclinner-ro-symptom-xlm-roberta",
-    aggregation_strategy="simple",
-)
-ner("...clinical text in Romanian...")
 ```
 ## Evaluation
 Scores were computed with the official [MultiClinAIEval](https://github.com/nlp4bia-bsc/MultiClinAIEval) library on the MultiClinNER **Romanian** gold-standard test documents for **SYMPTOM** (the shared-task gold set for this language and entity type).

 <summary>Click to expand</summary>
 - [Model description](#model-description)
+- [Usage](#usage)
 - [Evaluation](#evaluation)
 - [Limitations and bias](#limitations-and-bias)
 - [Additional information](#additional-information)
 ## Model description
 A fine-tuned version of [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) for **SYMPTOM Named Entity Recognition** in Romanian clinical text. It was developed for MultiClinNER, a subtask of the [MultiClinAI shared task](https://temu.bsc.es/MultiClinAI/) organized by the NLP4BIA team at the Barcelona Supercomputing Center as part of the #SMM4H-HeaRD Workshop at the ACL 2026 conference. The model labels mentions of **SYMPTOM** using the BIO tagging scheme: `O`, `B-SYMPTOM`, `I-SYMPTOM`.
+## Usage
+This is a token-classification (NER) model. **It was trained on pre-tokenized text** (`is_split_into_words=True`). For predictions that match our reported scores, reproduce that tokenization at inference (split into word/non-word tokens, run, then re-align offsets). A plain `pipeline(raw_text)` still gives correct character offsets, but its subword tokenization differs from training, so boundary predictions near punctuation may differ.
+### Recommended: `ner_nlp4bia` (handles pre-tokenization + offset re-alignment)
+```python
+# pip install git+https://github.com/nlp4bia-bsc/ner-nlp4bia.git   # not on PyPI; install from git
+from ner_nlp4bia.data.corpus import Document
+from ner_nlp4bia.inference.pipeline import PipelineInferencer
+inf = PipelineInferencer("BSC-NLP4BIA/multiclinner-ro-symptom-xlm-roberta")          # stride=128, aggregation="first"
+doc = inf.infer_document(Document(filename="d1", text=open("note.txt").read()))
+for a in doc.annotations:
+    print(a.label, a.start, a.end, repr(a.text))   # offsets in original-text coords
+```
+### Plain transformers (you must pre-tokenize and re-align yourself)
 ```python
+import re
 from transformers import pipeline
+TOK = re.compile(r'([0-9A-Za-zÀ-ÖØ-öø-ÿ]+|[^0-9A-Za-zÀ-ÖØ-öø-ÿ])')
+def pretokenize(text):
+    toks = [t for t in TOK.split(text) if t]
+    i = 1
+    while i < len(toks):
+        if not toks[i-1].isspace() and not toks[i].isspace():
+            toks.insert(i, ' '); i += 1
+        i += 1
+    pre = ''.join(toks)
+    oi = pj = 0; ins = []                 # positions of inserted spaces in `pre`
+    while pj < len(pre):
+        if oi < len(text) and text[oi] == pre[pj]:
+            oi += 1; pj += 1
+        else:
+            ins.append(pj); pj += 1
+    return pre, ins
+nlp = pipeline("token-classification", model="BSC-NLP4BIA/multiclinner-ro-symptom-xlm-roberta",
+               aggregation_strategy="first")
+# window long docs (RoBERTa/LtgBERT reserve 2 positions -> use mpe-2 for those)
+nlp.tokenizer.model_max_length = nlp.model.config.max_position_embeddings
+text = open("note.txt").read()
+pre, ins = pretokenize(text)
+for e in nlp(pre, stride=128):
+    if e["start"] == e["end"]:
+        continue
+    s = e["start"] - sum(1 for x in ins if x < e["start"])
+    t = e["end"]   - sum(1 for x in ins if x < e["end"])
+    while s < t and text[s].isspace():   s += 1     # byte-BPE leading-space trim
+    while t > s and text[t-1].isspace(): t -= 1
+    print(e["entity_group"], s, t, repr(text[s:t]))
 ```
+Notes:
+- `stride=128` matches the training stride; prevents truncation of long documents.
+- For RoBERTa/LtgBERT-family checkpoints set `model_max_length = max_position_embeddings - 2`.
 ## Evaluation
 Scores were computed with the official [MultiClinAIEval](https://github.com/nlp4bia-bsc/MultiClinAIEval) library on the MultiClinNER **Romanian** gold-standard test documents for **SYMPTOM** (the shared-task gold set for this language and entity type).