Sentence Similarity
sentence-transformers
Safetensors
mpnet
feature-extraction
mteb
financial
fiqa
finance
retrieval
rag
esg
fixed-income
equity
Eval Results (legacy)
text-embeddings-inference
Instructions to use mukaj/fin-mpnet-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use mukaj/fin-mpnet-base with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("mukaj/fin-mpnet-base") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Inference
- Notebooks
- Google Colab
- Kaggle
Julian Mukaj commited on
Commit ·
0a13784
1
Parent(s): 51ecb99
Initial commit
Browse files- 1_Pooling/config.json +7 -0
- README.md +135 -0
- config.json +24 -0
- config_sentence_transformers.json +7 -0
- model.safetensors +3 -0
- modules.json +20 -0
- mteb_metadata.md +89 -0
- sentence_bert_config.json +4 -0
- special_tokens_map.json +51 -0
- tokenizer.json +0 -0
- tokenizer_config.json +72 -0
- vocab.txt +0 -0
1_Pooling/config.json
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"word_embedding_dimension": 768,
|
| 3 |
+
"pooling_mode_cls_token": false,
|
| 4 |
+
"pooling_mode_mean_tokens": true,
|
| 5 |
+
"pooling_mode_max_tokens": false,
|
| 6 |
+
"pooling_mode_mean_sqrt_len_tokens": false
|
| 7 |
+
}
|
README.md
ADDED
|
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: sentence-similarity
|
| 3 |
+
tags:
|
| 4 |
+
- sentence-transformers
|
| 5 |
+
- feature-extraction
|
| 6 |
+
- sentence-similarity
|
| 7 |
+
tags:
|
| 8 |
+
- mteb
|
| 9 |
+
model-index:
|
| 10 |
+
- name: fin-mpnet-base-v0.1
|
| 11 |
+
results:
|
| 12 |
+
- task:
|
| 13 |
+
type: Classification
|
| 14 |
+
dataset:
|
| 15 |
+
type: mteb/banking77
|
| 16 |
+
name: MTEB Banking77Classification
|
| 17 |
+
config: default
|
| 18 |
+
split: test
|
| 19 |
+
revision: 0fd18e25b25c072e09e0d92ab615fda904d66300
|
| 20 |
+
metrics:
|
| 21 |
+
- type: accuracy
|
| 22 |
+
value: 80.25
|
| 23 |
+
- type: f1
|
| 24 |
+
value: 79.64999520103544
|
| 25 |
+
- task:
|
| 26 |
+
type: Retrieval
|
| 27 |
+
dataset:
|
| 28 |
+
type: fiqa
|
| 29 |
+
name: MTEB FiQA2018
|
| 30 |
+
config: default
|
| 31 |
+
split: test
|
| 32 |
+
revision: None
|
| 33 |
+
metrics:
|
| 34 |
+
- type: map_at_1
|
| 35 |
+
value: 37.747
|
| 36 |
+
- type: map_at_10
|
| 37 |
+
value: 72.223
|
| 38 |
+
- type: map_at_100
|
| 39 |
+
value: 73.802
|
| 40 |
+
- type: map_at_1000
|
| 41 |
+
value: 73.80499999999999
|
| 42 |
+
- type: map_at_3
|
| 43 |
+
value: 61.617999999999995
|
| 44 |
+
- type: map_at_5
|
| 45 |
+
value: 67.92200000000001
|
| 46 |
+
- type: mrr_at_1
|
| 47 |
+
value: 71.914
|
| 48 |
+
- type: mrr_at_10
|
| 49 |
+
value: 80.71000000000001
|
| 50 |
+
- type: mrr_at_100
|
| 51 |
+
value: 80.901
|
| 52 |
+
- type: mrr_at_1000
|
| 53 |
+
value: 80.901
|
| 54 |
+
- type: mrr_at_3
|
| 55 |
+
value: 78.935
|
| 56 |
+
- type: mrr_at_5
|
| 57 |
+
value: 80.193
|
| 58 |
+
- type: ndcg_at_1
|
| 59 |
+
value: 71.914
|
| 60 |
+
- type: ndcg_at_10
|
| 61 |
+
value: 79.912
|
| 62 |
+
- type: ndcg_at_100
|
| 63 |
+
value: 82.675
|
| 64 |
+
- type: ndcg_at_1000
|
| 65 |
+
value: 82.702
|
| 66 |
+
- type: ndcg_at_3
|
| 67 |
+
value: 73.252
|
| 68 |
+
- type: ndcg_at_5
|
| 69 |
+
value: 76.36
|
| 70 |
+
- type: precision_at_1
|
| 71 |
+
value: 71.914
|
| 72 |
+
- type: precision_at_10
|
| 73 |
+
value: 23.071
|
| 74 |
+
- type: precision_at_100
|
| 75 |
+
value: 2.62
|
| 76 |
+
- type: precision_at_1000
|
| 77 |
+
value: 0.263
|
| 78 |
+
- type: precision_at_3
|
| 79 |
+
value: 51.235
|
| 80 |
+
- type: precision_at_5
|
| 81 |
+
value: 38.117000000000004
|
| 82 |
+
- type: recall_at_1
|
| 83 |
+
value: 37.747
|
| 84 |
+
- type: recall_at_10
|
| 85 |
+
value: 91.346
|
| 86 |
+
- type: recall_at_100
|
| 87 |
+
value: 99.776
|
| 88 |
+
- type: recall_at_1000
|
| 89 |
+
value: 99.897
|
| 90 |
+
- type: recall_at_3
|
| 91 |
+
value: 68.691
|
| 92 |
+
- type: recall_at_5
|
| 93 |
+
value: 80.742
|
| 94 |
+
---
|
| 95 |
+
---
|
| 96 |
+
|
| 97 |
+
v0.1 - full evaluation not complete
|
| 98 |
+
# {MODEL_NAME}
|
| 99 |
+
|
| 100 |
+
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
|
| 101 |
+
|
| 102 |
+
<!--- Describe your model here -->
|
| 103 |
+
|
| 104 |
+
## Usage (Sentence-Transformers)
|
| 105 |
+
|
| 106 |
+
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
|
| 107 |
+
|
| 108 |
+
```
|
| 109 |
+
pip install -U sentence-transformers
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
Then you can use the model like this:
|
| 113 |
+
|
| 114 |
+
```python
|
| 115 |
+
from sentence_transformers import SentenceTransformer
|
| 116 |
+
sentences = ["This is an example sentence", "Each sentence is converted"]
|
| 117 |
+
|
| 118 |
+
model = SentenceTransformer('{MODEL_NAME}')
|
| 119 |
+
embeddings = model.encode(sentences)
|
| 120 |
+
print(embeddings)
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
## Evaluation Results
|
| 125 |
+
|
| 126 |
+
Model was evaluated during training only on the new finance QA examples, as such only financial relevant benchmarks were evaluated on for v0.1 [FiQA-2018, BankingClassification77]
|
| 127 |
+
|
| 128 |
+
The model currently shows the highest FiQA Retrieval score on the test set, on the MTEB Leaderboard (https://huggingface.co/spaces/mteb/leaderboard)
|
| 129 |
+
|
| 130 |
+
The model will have likely suffered some performance on other benchmarks, i.e. BankingClassification77 has dropped from 81.6 to 80.25, this will be addressed for v0.2 and full evaluation on all sets will be run.
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
## Training
|
| 134 |
+
|
| 135 |
+
"sentence-transformers/all-mpnet-base-v2" was fine-tuned on 150k financial document QA examples using MNR Loss.
|
config.json
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_name_or_path": "/root/.cache/torch/sentence_transformers/sentence-transformers_all-mpnet-base-v2/",
|
| 3 |
+
"architectures": [
|
| 4 |
+
"MPNetModel"
|
| 5 |
+
],
|
| 6 |
+
"attention_probs_dropout_prob": 0.1,
|
| 7 |
+
"bos_token_id": 0,
|
| 8 |
+
"eos_token_id": 2,
|
| 9 |
+
"hidden_act": "gelu",
|
| 10 |
+
"hidden_dropout_prob": 0.1,
|
| 11 |
+
"hidden_size": 768,
|
| 12 |
+
"initializer_range": 0.02,
|
| 13 |
+
"intermediate_size": 3072,
|
| 14 |
+
"layer_norm_eps": 1e-05,
|
| 15 |
+
"max_position_embeddings": 514,
|
| 16 |
+
"model_type": "mpnet",
|
| 17 |
+
"num_attention_heads": 12,
|
| 18 |
+
"num_hidden_layers": 12,
|
| 19 |
+
"pad_token_id": 1,
|
| 20 |
+
"relative_attention_num_buckets": 32,
|
| 21 |
+
"torch_dtype": "float32",
|
| 22 |
+
"transformers_version": "4.36.2",
|
| 23 |
+
"vocab_size": 30527
|
| 24 |
+
}
|
config_sentence_transformers.json
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"__version__": {
|
| 3 |
+
"sentence_transformers": "2.0.0",
|
| 4 |
+
"transformers": "4.6.1",
|
| 5 |
+
"pytorch": "1.8.1"
|
| 6 |
+
}
|
| 7 |
+
}
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:27a9f9dacab04c4d7ff8f528cb3515a8c8e21ff5d0ac2ddc9f9250efc554ec22
|
| 3 |
+
size 437967672
|
modules.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"idx": 0,
|
| 4 |
+
"name": "0",
|
| 5 |
+
"path": "",
|
| 6 |
+
"type": "sentence_transformers.models.Transformer"
|
| 7 |
+
},
|
| 8 |
+
{
|
| 9 |
+
"idx": 1,
|
| 10 |
+
"name": "1",
|
| 11 |
+
"path": "1_Pooling",
|
| 12 |
+
"type": "sentence_transformers.models.Pooling"
|
| 13 |
+
},
|
| 14 |
+
{
|
| 15 |
+
"idx": 2,
|
| 16 |
+
"name": "2",
|
| 17 |
+
"path": "2_Normalize",
|
| 18 |
+
"type": "sentence_transformers.models.Normalize"
|
| 19 |
+
}
|
| 20 |
+
]
|
mteb_metadata.md
ADDED
|
@@ -0,0 +1,89 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- mteb
|
| 4 |
+
model-index:
|
| 5 |
+
- name: fin-mpnet-base-v0.1
|
| 6 |
+
results:
|
| 7 |
+
- task:
|
| 8 |
+
type: Classification
|
| 9 |
+
dataset:
|
| 10 |
+
type: mteb/banking77
|
| 11 |
+
name: MTEB Banking77Classification
|
| 12 |
+
config: default
|
| 13 |
+
split: test
|
| 14 |
+
revision: 0fd18e25b25c072e09e0d92ab615fda904d66300
|
| 15 |
+
metrics:
|
| 16 |
+
- type: accuracy
|
| 17 |
+
value: 80.25
|
| 18 |
+
- type: f1
|
| 19 |
+
value: 79.64999520103544
|
| 20 |
+
- task:
|
| 21 |
+
type: Retrieval
|
| 22 |
+
dataset:
|
| 23 |
+
type: fiqa
|
| 24 |
+
name: MTEB FiQA2018
|
| 25 |
+
config: default
|
| 26 |
+
split: test
|
| 27 |
+
revision: None
|
| 28 |
+
metrics:
|
| 29 |
+
- type: map_at_1
|
| 30 |
+
value: 37.747
|
| 31 |
+
- type: map_at_10
|
| 32 |
+
value: 72.223
|
| 33 |
+
- type: map_at_100
|
| 34 |
+
value: 73.802
|
| 35 |
+
- type: map_at_1000
|
| 36 |
+
value: 73.80499999999999
|
| 37 |
+
- type: map_at_3
|
| 38 |
+
value: 61.617999999999995
|
| 39 |
+
- type: map_at_5
|
| 40 |
+
value: 67.92200000000001
|
| 41 |
+
- type: mrr_at_1
|
| 42 |
+
value: 71.914
|
| 43 |
+
- type: mrr_at_10
|
| 44 |
+
value: 80.71000000000001
|
| 45 |
+
- type: mrr_at_100
|
| 46 |
+
value: 80.901
|
| 47 |
+
- type: mrr_at_1000
|
| 48 |
+
value: 80.901
|
| 49 |
+
- type: mrr_at_3
|
| 50 |
+
value: 78.935
|
| 51 |
+
- type: mrr_at_5
|
| 52 |
+
value: 80.193
|
| 53 |
+
- type: ndcg_at_1
|
| 54 |
+
value: 71.914
|
| 55 |
+
- type: ndcg_at_10
|
| 56 |
+
value: 79.912
|
| 57 |
+
- type: ndcg_at_100
|
| 58 |
+
value: 82.675
|
| 59 |
+
- type: ndcg_at_1000
|
| 60 |
+
value: 82.702
|
| 61 |
+
- type: ndcg_at_3
|
| 62 |
+
value: 73.252
|
| 63 |
+
- type: ndcg_at_5
|
| 64 |
+
value: 76.36
|
| 65 |
+
- type: precision_at_1
|
| 66 |
+
value: 71.914
|
| 67 |
+
- type: precision_at_10
|
| 68 |
+
value: 23.071
|
| 69 |
+
- type: precision_at_100
|
| 70 |
+
value: 2.62
|
| 71 |
+
- type: precision_at_1000
|
| 72 |
+
value: 0.263
|
| 73 |
+
- type: precision_at_3
|
| 74 |
+
value: 51.235
|
| 75 |
+
- type: precision_at_5
|
| 76 |
+
value: 38.117000000000004
|
| 77 |
+
- type: recall_at_1
|
| 78 |
+
value: 37.747
|
| 79 |
+
- type: recall_at_10
|
| 80 |
+
value: 91.346
|
| 81 |
+
- type: recall_at_100
|
| 82 |
+
value: 99.776
|
| 83 |
+
- type: recall_at_1000
|
| 84 |
+
value: 99.897
|
| 85 |
+
- type: recall_at_3
|
| 86 |
+
value: 68.691
|
| 87 |
+
- type: recall_at_5
|
| 88 |
+
value: 80.742
|
| 89 |
+
---
|
sentence_bert_config.json
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"max_seq_length": 384,
|
| 3 |
+
"do_lower_case": false
|
| 4 |
+
}
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"bos_token": {
|
| 3 |
+
"content": "<s>",
|
| 4 |
+
"lstrip": false,
|
| 5 |
+
"normalized": false,
|
| 6 |
+
"rstrip": false,
|
| 7 |
+
"single_word": false
|
| 8 |
+
},
|
| 9 |
+
"cls_token": {
|
| 10 |
+
"content": "<s>",
|
| 11 |
+
"lstrip": false,
|
| 12 |
+
"normalized": false,
|
| 13 |
+
"rstrip": false,
|
| 14 |
+
"single_word": false
|
| 15 |
+
},
|
| 16 |
+
"eos_token": {
|
| 17 |
+
"content": "</s>",
|
| 18 |
+
"lstrip": false,
|
| 19 |
+
"normalized": false,
|
| 20 |
+
"rstrip": false,
|
| 21 |
+
"single_word": false
|
| 22 |
+
},
|
| 23 |
+
"mask_token": {
|
| 24 |
+
"content": "<mask>",
|
| 25 |
+
"lstrip": true,
|
| 26 |
+
"normalized": false,
|
| 27 |
+
"rstrip": false,
|
| 28 |
+
"single_word": false
|
| 29 |
+
},
|
| 30 |
+
"pad_token": {
|
| 31 |
+
"content": "<pad>",
|
| 32 |
+
"lstrip": false,
|
| 33 |
+
"normalized": false,
|
| 34 |
+
"rstrip": false,
|
| 35 |
+
"single_word": false
|
| 36 |
+
},
|
| 37 |
+
"sep_token": {
|
| 38 |
+
"content": "</s>",
|
| 39 |
+
"lstrip": false,
|
| 40 |
+
"normalized": false,
|
| 41 |
+
"rstrip": false,
|
| 42 |
+
"single_word": false
|
| 43 |
+
},
|
| 44 |
+
"unk_token": {
|
| 45 |
+
"content": "[UNK]",
|
| 46 |
+
"lstrip": false,
|
| 47 |
+
"normalized": false,
|
| 48 |
+
"rstrip": false,
|
| 49 |
+
"single_word": false
|
| 50 |
+
}
|
| 51 |
+
}
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,72 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"0": {
|
| 4 |
+
"content": "<s>",
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": false,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"1": {
|
| 12 |
+
"content": "<pad>",
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": false,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
},
|
| 19 |
+
"2": {
|
| 20 |
+
"content": "</s>",
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"normalized": false,
|
| 23 |
+
"rstrip": false,
|
| 24 |
+
"single_word": false,
|
| 25 |
+
"special": true
|
| 26 |
+
},
|
| 27 |
+
"3": {
|
| 28 |
+
"content": "<unk>",
|
| 29 |
+
"lstrip": false,
|
| 30 |
+
"normalized": true,
|
| 31 |
+
"rstrip": false,
|
| 32 |
+
"single_word": false,
|
| 33 |
+
"special": true
|
| 34 |
+
},
|
| 35 |
+
"104": {
|
| 36 |
+
"content": "[UNK]",
|
| 37 |
+
"lstrip": false,
|
| 38 |
+
"normalized": false,
|
| 39 |
+
"rstrip": false,
|
| 40 |
+
"single_word": false,
|
| 41 |
+
"special": true
|
| 42 |
+
},
|
| 43 |
+
"30526": {
|
| 44 |
+
"content": "<mask>",
|
| 45 |
+
"lstrip": true,
|
| 46 |
+
"normalized": false,
|
| 47 |
+
"rstrip": false,
|
| 48 |
+
"single_word": false,
|
| 49 |
+
"special": true
|
| 50 |
+
}
|
| 51 |
+
},
|
| 52 |
+
"bos_token": "<s>",
|
| 53 |
+
"clean_up_tokenization_spaces": true,
|
| 54 |
+
"cls_token": "<s>",
|
| 55 |
+
"do_lower_case": true,
|
| 56 |
+
"eos_token": "</s>",
|
| 57 |
+
"mask_token": "<mask>",
|
| 58 |
+
"max_length": 128,
|
| 59 |
+
"model_max_length": 512,
|
| 60 |
+
"pad_to_multiple_of": null,
|
| 61 |
+
"pad_token": "<pad>",
|
| 62 |
+
"pad_token_type_id": 0,
|
| 63 |
+
"padding_side": "right",
|
| 64 |
+
"sep_token": "</s>",
|
| 65 |
+
"stride": 0,
|
| 66 |
+
"strip_accents": null,
|
| 67 |
+
"tokenize_chinese_chars": true,
|
| 68 |
+
"tokenizer_class": "MPNetTokenizer",
|
| 69 |
+
"truncation_side": "right",
|
| 70 |
+
"truncation_strategy": "longest_first",
|
| 71 |
+
"unk_token": "[UNK]"
|
| 72 |
+
}
|
vocab.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|