File size: 3,302 Bytes
093423f
 
 
 
 
 
 
 
5e72a6f
508dc66
093423f
 
 
 
51e64bd
 
 
 
 
 
 
7414f5a
51e64bd
 
 
 
 
 
 
 
 
 
093423f
 
 
51e64bd
 
 
 
 
 
e2e1d05
093423f
 
 
 
51e64bd
093423f
51e64bd
 
093423f
 
 
 
 
51e64bd
093423f
51e64bd
 
e2e1d05
 
 
 
 
51e64bd
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
---
language:
- bn
tags:
- bert
- bangla
- mlm
- nsp
library_name: transformers
#base_model: "banglagov/banBERT-Base"
---

# BERT base model for Bangla

Pretrained [BERT](https://arxiv.org/abs/1810.04805) model for Bangla. BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language
model introduced by Google's research team. BERT has significantly advanced the
state-of-the-art in various NLP tasks. Unlike traditional language models, BERT is bidirectional,
meaning it takes into account both the left and right contexts of each word during pre-training,
enabling it to better grasp the nuances of language.


## Data Details

We used 36 GB of text data to train the model. The used corpus has the following cardinalities:

| **Type**          | **Count**                             |
|--------------------|---------------------------------------|
| Total words        | 2,202,024,981 (about 2.2 billion)    |
| Unique words       | 22,944,811 (about 22.94 million)     |
| Total sentences    | 181,447,732 (about 181.45 million)   |
| Total documents    | 17,516,890 (about 17.52 million)     |


## Model Details

The core architecture of BERT is based on the Transformer model, which utilizes self-attention
mechanisms to capture long-range dependencies in text efficiently. During pre-training, BERT
learns contextualized word embeddings by predicting missing words within sentences, a
process known as masked language modeling. This allows BERT to understand words in the
context of their surrounding words, leading to more meaningful and context-aware embeddings.

This model is based on the BERT-Base architecture with 12 layers, 768 hidden size, 12 attention heads, and 110 million parameters.

## How to use

```python
from transformers import BertModel, BertTokenizer

model = BertModel.from_pretrained("banglagov/banBERT-Base")
tokenizer = BertTokenizer.from_pretrained("banglagov/banBERT-Base")

text = "আমি বাংলায় পড়ি।"

tokenized_text = tokenizer(text, return_tensors="pt")
outputs = model(**tokenized_text)
print(outputs)
```


## Training Details

The model was trained on a corpus of 36 GB Bangla text data with a vocabulary size of 50k tokens. The model was trained for 1 million steps with a batch size of 440 and a learning rate of 5e-5. The model was trained on two NVIDIA GeForce A40 GPUs.


## Results

| **Metric**          | **Train Loss** | **Eval Loss** | **Perplexity** | **NER**  | **POS**  | **Shallow Parsing** | **QA**  |
|----------------------|----------------|---------------|----------------|----------|----------|----------------------|---------|
| Precision            | -              | -             | -              | 0.8475   | 0.8838   | 0.7396               | -       |
| Recall               | -              | -             | -              | 0.7390   | 0.8543   | 0.6858               | -       |
| Macro F1             | -              | -             | -              | 0.7786   | 0.8611   | 0.7117               | 0.7396  |
| Exact Match          | -              | -             | -              | -        | -        | -                    | 0.6809  |
| Loss                 | 1.8633         | 1.4681        | 4.3826         | -        | -        | -                    | -       |