Commit ·
f88860a
1
Parent(s): 615c276
Update model card: add full Usage section (pre-tokenization + offset re-alignment)
Browse files
README.md
CHANGED
|
@@ -41,7 +41,7 @@ model-index:
|
|
| 41 |
<summary>Click to expand</summary>
|
| 42 |
|
| 43 |
- [Model description](#model-description)
|
| 44 |
-
- [
|
| 45 |
- [Evaluation](#evaluation)
|
| 46 |
- [Limitations and bias](#limitations-and-bias)
|
| 47 |
- [Additional information](#additional-information)
|
|
@@ -56,21 +56,68 @@ model-index:
|
|
| 56 |
## Model description
|
| 57 |
A fine-tuned version of [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) for **SYMPTOM Named Entity Recognition** in Romanian clinical text. It was developed for MultiClinNER, a subtask of the [MultiClinAI shared task](https://temu.bsc.es/MultiClinAI/) organized by the NLP4BIA team at the Barcelona Supercomputing Center as part of the #SMM4H-HeaRD Workshop at the ACL 2026 conference. The model labels mentions of **SYMPTOM** using the BIO tagging scheme: `O`, `B-SYMPTOM`, `I-SYMPTOM`.
|
| 58 |
|
| 59 |
-
##
|
| 60 |
|
| 61 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
```python
|
|
|
|
| 64 |
from transformers import pipeline
|
| 65 |
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
```
|
| 73 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
## Evaluation
|
| 75 |
|
| 76 |
Scores were computed with the official [MultiClinAIEval](https://github.com/nlp4bia-bsc/MultiClinAIEval) library on the MultiClinNER **Romanian** gold-standard test documents for **SYMPTOM** (the shared-task gold set for this language and entity type).
|
|
|
|
| 41 |
<summary>Click to expand</summary>
|
| 42 |
|
| 43 |
- [Model description](#model-description)
|
| 44 |
+
- [Usage](#usage)
|
| 45 |
- [Evaluation](#evaluation)
|
| 46 |
- [Limitations and bias](#limitations-and-bias)
|
| 47 |
- [Additional information](#additional-information)
|
|
|
|
| 56 |
## Model description
|
| 57 |
A fine-tuned version of [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) for **SYMPTOM Named Entity Recognition** in Romanian clinical text. It was developed for MultiClinNER, a subtask of the [MultiClinAI shared task](https://temu.bsc.es/MultiClinAI/) organized by the NLP4BIA team at the Barcelona Supercomputing Center as part of the #SMM4H-HeaRD Workshop at the ACL 2026 conference. The model labels mentions of **SYMPTOM** using the BIO tagging scheme: `O`, `B-SYMPTOM`, `I-SYMPTOM`.
|
| 58 |
|
| 59 |
+
## Usage
|
| 60 |
|
| 61 |
+
This is a token-classification (NER) model. **It was trained on pre-tokenized text** (`is_split_into_words=True`). For predictions that match our reported scores, reproduce that tokenization at inference (split into word/non-word tokens, run, then re-align offsets). A plain `pipeline(raw_text)` still gives correct character offsets, but its subword tokenization differs from training, so boundary predictions near punctuation may differ.
|
| 62 |
+
|
| 63 |
+
### Recommended: `ner_nlp4bia` (handles pre-tokenization + offset re-alignment)
|
| 64 |
+
|
| 65 |
+
```python
|
| 66 |
+
# pip install git+https://github.com/nlp4bia-bsc/ner-nlp4bia.git # not on PyPI; install from git
|
| 67 |
+
from ner_nlp4bia.data.corpus import Document
|
| 68 |
+
from ner_nlp4bia.inference.pipeline import PipelineInferencer
|
| 69 |
+
|
| 70 |
+
inf = PipelineInferencer("BSC-NLP4BIA/multiclinner-ro-symptom-xlm-roberta") # stride=128, aggregation="first"
|
| 71 |
+
doc = inf.infer_document(Document(filename="d1", text=open("note.txt").read()))
|
| 72 |
+
for a in doc.annotations:
|
| 73 |
+
print(a.label, a.start, a.end, repr(a.text)) # offsets in original-text coords
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
### Plain transformers (you must pre-tokenize and re-align yourself)
|
| 77 |
|
| 78 |
```python
|
| 79 |
+
import re
|
| 80 |
from transformers import pipeline
|
| 81 |
|
| 82 |
+
TOK = re.compile(r'([0-9A-Za-zÀ-ÖØ-öø-ÿ]+|[^0-9A-Za-zÀ-ÖØ-öø-ÿ])')
|
| 83 |
+
|
| 84 |
+
def pretokenize(text):
|
| 85 |
+
toks = [t for t in TOK.split(text) if t]
|
| 86 |
+
i = 1
|
| 87 |
+
while i < len(toks):
|
| 88 |
+
if not toks[i-1].isspace() and not toks[i].isspace():
|
| 89 |
+
toks.insert(i, ' '); i += 1
|
| 90 |
+
i += 1
|
| 91 |
+
pre = ''.join(toks)
|
| 92 |
+
oi = pj = 0; ins = [] # positions of inserted spaces in `pre`
|
| 93 |
+
while pj < len(pre):
|
| 94 |
+
if oi < len(text) and text[oi] == pre[pj]:
|
| 95 |
+
oi += 1; pj += 1
|
| 96 |
+
else:
|
| 97 |
+
ins.append(pj); pj += 1
|
| 98 |
+
return pre, ins
|
| 99 |
+
|
| 100 |
+
nlp = pipeline("token-classification", model="BSC-NLP4BIA/multiclinner-ro-symptom-xlm-roberta",
|
| 101 |
+
aggregation_strategy="first")
|
| 102 |
+
# window long docs (RoBERTa/LtgBERT reserve 2 positions -> use mpe-2 for those)
|
| 103 |
+
nlp.tokenizer.model_max_length = nlp.model.config.max_position_embeddings
|
| 104 |
+
|
| 105 |
+
text = open("note.txt").read()
|
| 106 |
+
pre, ins = pretokenize(text)
|
| 107 |
+
for e in nlp(pre, stride=128):
|
| 108 |
+
if e["start"] == e["end"]:
|
| 109 |
+
continue
|
| 110 |
+
s = e["start"] - sum(1 for x in ins if x < e["start"])
|
| 111 |
+
t = e["end"] - sum(1 for x in ins if x < e["end"])
|
| 112 |
+
while s < t and text[s].isspace(): s += 1 # byte-BPE leading-space trim
|
| 113 |
+
while t > s and text[t-1].isspace(): t -= 1
|
| 114 |
+
print(e["entity_group"], s, t, repr(text[s:t]))
|
| 115 |
```
|
| 116 |
|
| 117 |
+
Notes:
|
| 118 |
+
- `stride=128` matches the training stride; prevents truncation of long documents.
|
| 119 |
+
- For RoBERTa/LtgBERT-family checkpoints set `model_max_length = max_position_embeddings - 2`.
|
| 120 |
+
|
| 121 |
## Evaluation
|
| 122 |
|
| 123 |
Scores were computed with the official [MultiClinAIEval](https://github.com/nlp4bia-bsc/MultiClinAIEval) library on the MultiClinNER **Romanian** gold-standard test documents for **SYMPTOM** (the shared-task gold set for this language and entity type).
|