janrodriguez commited on
Commit
f88860a
·
1 Parent(s): 615c276

Update model card: add full Usage section (pre-tokenization + offset re-alignment)

Browse files
Files changed (1) hide show
  1. README.md +56 -9
README.md CHANGED
@@ -41,7 +41,7 @@ model-index:
41
  <summary>Click to expand</summary>
42
 
43
  - [Model description](#model-description)
44
- - [How to use](#how-to-use)
45
  - [Evaluation](#evaluation)
46
  - [Limitations and bias](#limitations-and-bias)
47
  - [Additional information](#additional-information)
@@ -56,21 +56,68 @@ model-index:
56
  ## Model description
57
  A fine-tuned version of [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) for **SYMPTOM Named Entity Recognition** in Romanian clinical text. It was developed for MultiClinNER, a subtask of the [MultiClinAI shared task](https://temu.bsc.es/MultiClinAI/) organized by the NLP4BIA team at the Barcelona Supercomputing Center as part of the #SMM4H-HeaRD Workshop at the ACL 2026 conference. The model labels mentions of **SYMPTOM** using the BIO tagging scheme: `O`, `B-SYMPTOM`, `I-SYMPTOM`.
58
 
59
- ## How to use
60
 
61
- We recommend pre-tokenizing the input text into words instead of providing it directly to the model, as this is how the model was trained. Otherwise, the results and performance might get affected.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
  ```python
 
64
  from transformers import pipeline
65
 
66
- ner = pipeline(
67
- "token-classification",
68
- model="BSC-NLP4BIA/multiclinner-ro-symptom-xlm-roberta",
69
- aggregation_strategy="simple",
70
- )
71
- ner("...clinical text in Romanian...")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  ```
73
 
 
 
 
 
74
  ## Evaluation
75
 
76
  Scores were computed with the official [MultiClinAIEval](https://github.com/nlp4bia-bsc/MultiClinAIEval) library on the MultiClinNER **Romanian** gold-standard test documents for **SYMPTOM** (the shared-task gold set for this language and entity type).
 
41
  <summary>Click to expand</summary>
42
 
43
  - [Model description](#model-description)
44
+ - [Usage](#usage)
45
  - [Evaluation](#evaluation)
46
  - [Limitations and bias](#limitations-and-bias)
47
  - [Additional information](#additional-information)
 
56
  ## Model description
57
  A fine-tuned version of [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) for **SYMPTOM Named Entity Recognition** in Romanian clinical text. It was developed for MultiClinNER, a subtask of the [MultiClinAI shared task](https://temu.bsc.es/MultiClinAI/) organized by the NLP4BIA team at the Barcelona Supercomputing Center as part of the #SMM4H-HeaRD Workshop at the ACL 2026 conference. The model labels mentions of **SYMPTOM** using the BIO tagging scheme: `O`, `B-SYMPTOM`, `I-SYMPTOM`.
58
 
59
+ ## Usage
60
 
61
+ This is a token-classification (NER) model. **It was trained on pre-tokenized text** (`is_split_into_words=True`). For predictions that match our reported scores, reproduce that tokenization at inference (split into word/non-word tokens, run, then re-align offsets). A plain `pipeline(raw_text)` still gives correct character offsets, but its subword tokenization differs from training, so boundary predictions near punctuation may differ.
62
+
63
+ ### Recommended: `ner_nlp4bia` (handles pre-tokenization + offset re-alignment)
64
+
65
+ ```python
66
+ # pip install git+https://github.com/nlp4bia-bsc/ner-nlp4bia.git # not on PyPI; install from git
67
+ from ner_nlp4bia.data.corpus import Document
68
+ from ner_nlp4bia.inference.pipeline import PipelineInferencer
69
+
70
+ inf = PipelineInferencer("BSC-NLP4BIA/multiclinner-ro-symptom-xlm-roberta") # stride=128, aggregation="first"
71
+ doc = inf.infer_document(Document(filename="d1", text=open("note.txt").read()))
72
+ for a in doc.annotations:
73
+ print(a.label, a.start, a.end, repr(a.text)) # offsets in original-text coords
74
+ ```
75
+
76
+ ### Plain transformers (you must pre-tokenize and re-align yourself)
77
 
78
  ```python
79
+ import re
80
  from transformers import pipeline
81
 
82
+ TOK = re.compile(r'([0-9A-Za-zÀ-ÖØ-öø-ÿ]+|[^0-9A-Za-zÀ-ÖØ-öø-ÿ])')
83
+
84
+ def pretokenize(text):
85
+ toks = [t for t in TOK.split(text) if t]
86
+ i = 1
87
+ while i < len(toks):
88
+ if not toks[i-1].isspace() and not toks[i].isspace():
89
+ toks.insert(i, ' '); i += 1
90
+ i += 1
91
+ pre = ''.join(toks)
92
+ oi = pj = 0; ins = [] # positions of inserted spaces in `pre`
93
+ while pj < len(pre):
94
+ if oi < len(text) and text[oi] == pre[pj]:
95
+ oi += 1; pj += 1
96
+ else:
97
+ ins.append(pj); pj += 1
98
+ return pre, ins
99
+
100
+ nlp = pipeline("token-classification", model="BSC-NLP4BIA/multiclinner-ro-symptom-xlm-roberta",
101
+ aggregation_strategy="first")
102
+ # window long docs (RoBERTa/LtgBERT reserve 2 positions -> use mpe-2 for those)
103
+ nlp.tokenizer.model_max_length = nlp.model.config.max_position_embeddings
104
+
105
+ text = open("note.txt").read()
106
+ pre, ins = pretokenize(text)
107
+ for e in nlp(pre, stride=128):
108
+ if e["start"] == e["end"]:
109
+ continue
110
+ s = e["start"] - sum(1 for x in ins if x < e["start"])
111
+ t = e["end"] - sum(1 for x in ins if x < e["end"])
112
+ while s < t and text[s].isspace(): s += 1 # byte-BPE leading-space trim
113
+ while t > s and text[t-1].isspace(): t -= 1
114
+ print(e["entity_group"], s, t, repr(text[s:t]))
115
  ```
116
 
117
+ Notes:
118
+ - `stride=128` matches the training stride; prevents truncation of long documents.
119
+ - For RoBERTa/LtgBERT-family checkpoints set `model_max_length = max_position_embeddings - 2`.
120
+
121
  ## Evaluation
122
 
123
  Scores were computed with the official [MultiClinAIEval](https://github.com/nlp4bia-bsc/MultiClinAIEval) library on the MultiClinNER **Romanian** gold-standard test documents for **SYMPTOM** (the shared-task gold set for this language and entity type).