Feature Extraction
Transformers
PyTorch
TensorFlow
Dutch
roberta
Biomedical entity linking
sapBERT
bioNLP
embeddings
representation learning
text-embeddings-inference
Instructions to use fonshartendorp/dutch_biomedical_entity_linking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use fonshartendorp/dutch_biomedical_entity_linking with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="fonshartendorp/dutch_biomedical_entity_linking")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("fonshartendorp/dutch_biomedical_entity_linking") model = AutoModel.from_pretrained("fonshartendorp/dutch_biomedical_entity_linking") - Notebooks
- Google Colab
- Kaggle
Dutch Biomedical Entity Linking
Summary
- RoBERTa-based basemodel that is trained from scratch on Dutch hospital notes (medRoBERTa.nl).
- 2nd-phase pretrained using self-alignment on UMLS-derived Dutch biomedical ontology.
- fine-tuned on automatically generated weakly labelled corpus from Wikipedia.
- evaluation results on Mantra GSC corpus can be found in the report
All code for generating the training data, training the model and evaluating it, can be found in the github repository.
Usage
The following script (reused the original sapBERT repository) computes the embeddings for a list of input entities (strings)
import numpy as np
import torch
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("fonshartendorp/dutch_biomedical_entity_linking)")
model = AutoModel.from_pretrained("fonshartendorp/dutch_biomedical_entity_linking").cuda()
# replace with your own list of entity names
dutch_biomedical_entities = ["versnelde ademhaling", "Coronavirus infectie", "aandachtstekort/hyperactiviteitstoornis", "hartaanval"]
bs = 128 # batch size during inference
all_embs = []
for i in tqdm(np.arange(0, len(dutch_biomedical_entities), bs)):
toks = tokenizer.batch_encode_plus(dutch_biomedical_entities[i:i+bs],
padding="max_length",
max_length=25,
truncation=True,
return_tensors="pt")
toks_cuda = {}
for k,v in toks.items():
toks_cuda[k] = v.cuda()
cls_rep = model(**toks_cuda)[0][:,0,:] # use CLS representation as the embedding
all_embs.append(cls_rep.cpu().detach().numpy())
all_embs = np.concatenate(all_embs, axis=0)
For (Dutch) biomedical entity linking, the following steps should be performed:
- Request UMLS (and SNOMED NL) license
- Precompute embeddings for all entities in the UMLS with the fine-tuned model
- Compute embedding of the new, unseen mention with the fine-tuned model
- Perform nearest-neighbour search (or search FAISS-index) for linking the embedding of the new mention to its most similar embedding from the UMLS
- Downloads last month
- 9