Instructions to use SinaLab/ArabicWojood-FlatNER with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SinaLab/ArabicWojood-FlatNER with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="SinaLab/ArabicWojood-FlatNER")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("SinaLab/ArabicWojood-FlatNER") model = AutoModelForMaskedLM.from_pretrained("SinaLab/ArabicWojood-FlatNER") - Notebooks
- Google Colab
- Kaggle
Upload 6 files
Browse files- README.md +39 -0
- config.json +17 -0
- special_tokens_map.json +1 -0
- tokenizer.json +0 -0
- tokenizer_config.json +1 -0
- vocab.txt +0 -0
README.md
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
datasets:
|
| 4 |
+
- Wojood
|
| 5 |
+
tags:
|
| 6 |
+
- Named Entity Recognition
|
| 7 |
+
- Arabic NER
|
| 8 |
+
- Nested NER
|
| 9 |
+
language:
|
| 10 |
+
- ar
|
| 11 |
+
metrics:
|
| 12 |
+
- f1
|
| 13 |
+
- precision
|
| 14 |
+
- recall
|
| 15 |
+
pipeline_tag: token-classification
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
## Wojood - Nested/Flat Arabic NER Models
|
| 19 |
+
Wojood is a corpus for Arabic nested Named Entity Recognition (NER). Nested entities occur when one entity mention is embedded inside another entity mention. 550K tokens (MSA and dialect) This repo contains the source-code to train Wojood nested NER.
|
| 20 |
+
|
| 21 |
+
Online Demo
|
| 22 |
+
You can try our model using the demo link below
|
| 23 |
+
|
| 24 |
+
https://sina.birzeit.edu/wojood/
|
| 25 |
+
|
| 26 |
+
https://arxiv.org/abs/2205.09651
|
| 27 |
+
|
| 28 |
+
https://huggingface.co/aubmindlab/bert-base-arabertv2/tree/main
|
| 29 |
+
|
| 30 |
+
### Models
|
| 31 |
+
* Nested NER (main branch), with micro-F1 score of 0.909551
|
| 32 |
+
* Flat NER (flat branch), with micro-F1 score 0.883847
|
| 33 |
+
|
| 34 |
+
### Google Colab Notebooks
|
| 35 |
+
You can test our model using our Google Colab notebooks
|
| 36 |
+
* Train flat NER: https://gist.github.com/mohammedkhalilia/72c3261734d7715094089bdf4de74b4a
|
| 37 |
+
* Evaluate your model using flat NER model: https://gist.github.com/mohammedkhalilia/c807eb1ccb15416b187c32a362001665
|
| 38 |
+
* Train nested NER: https://gist.github.com/mohammedkhalilia/a4d83d4e43682d1efcdf299d41beb3da
|
| 39 |
+
* Evaluate your data using nested NER model: https://gist.github.com/mohammedkhalilia/9134510aa2684464f57de7934c97138b
|
config.json
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"BertForMaskedLM"
|
| 4 |
+
],
|
| 5 |
+
"attention_probs_dropout_prob": 0.1,
|
| 6 |
+
"hidden_act": "gelu",
|
| 7 |
+
"hidden_dropout_prob": 0.1,
|
| 8 |
+
"hidden_size": 768,
|
| 9 |
+
"initializer_range": 0.02,
|
| 10 |
+
"intermediate_size": 3072,
|
| 11 |
+
"max_position_embeddings": 512,
|
| 12 |
+
"model_type": "bert",
|
| 13 |
+
"num_attention_heads": 12,
|
| 14 |
+
"num_hidden_layers": 12,
|
| 15 |
+
"type_vocab_size": 2,
|
| 16 |
+
"vocab_size": 64000
|
| 17 |
+
}
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"do_lower_case": false, "max_len": 512, "do_basic_tokenize": true, "never_split": ["+ู", "+ูู
ุง", "ู+", "+ูุง", "+ูู", "ู+", "+ูู", "+ุงู", "+ูู
", "+ุฉ", "[ุจุฑูุฏ]", "ูู+", "+ู", "+ุช", "+ู", "ุณ+", "ู+", "[ู
ุณุชุฎุฏู
]", "+ูู
", "+ุง", "ุจ+", "ู+", "+ูุง", "+ูุง", "+ูู", "+ูู
ุง", "ุงู+", "+ู", "+ูู", "+ุงุช", "[ุฑุงุจุท]"], "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "name_or_path": "aubmindlab/bert-large-arabertv2"}
|
vocab.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|