banglagov
/

banBERT-Base

@@ -10,22 +10,70 @@ tags:
 # BERT base model for Bangla
-Pretrained [BERT](https://arxiv.org/abs/1810.04805) model for Bangla. The model was trained on Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks.
 ## Model Details
 This model is based on the BERT-Base architecture with 12 layers, 768 hidden size, 12 attention heads, and 110 million parameters. The model was trained on a corpus of 39 GB Bangla text data with a vocabulary size of 50k tokens. The model was trained for 1 million steps with a batch size of 440 and a learning rate of 5e-5. The model was trained on two NVIDIA GeForce A40 GPUs.
 ## How to use
 ```python
-from transformers import AutoModel, AutoTokenizer
-model = AutoModel.from_pretrained("banglagov/banBERT-Base")
-tokenizer = AutoTokenizer.from_pretrained("banglagov/banBERT-Base")
 text = "আমি বাংলায় পড়ি।"
 tokenized_text = tokenizer(text, return_tensors="pt")
 outputs = model(**tokenized_text)
 ```

 # BERT base model for Bangla
+Pretrained [BERT](https://arxiv.org/abs/1810.04805) model for Bangla. BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language
+model introduced by Google's research team. BERT has significantly advanced the
+state-of-the-art in various NLP tasks. Unlike traditional language models, BERT is bidirectional,
+meaning it takes into account both the left and right contexts of each word during pre-training,
+enabling it to better grasp the nuances of language.
+## 2.2.1 Data Details
+We used 36 GB of text data to train the model. The used corpus has the following cardinalities:
+| **Type**          | **Count**                             |
+|--------------------|---------------------------------------|
+| Total words        | 2,202,024,981 (about 2.2 billion)    |
+| Unique words       | 22,944,811 (about 22.94 million)     |
+| Total sentences    | 181,447,732 (about 181.45 million)   |
+| Total documents    | 17,516,890 (about 17.52 million)     |
+The raw crawled text is pre-processed in several steps to produce the final 36 GB of data. The pre-processing contains the following steps:
+  - Normalization of text
+  - Cleaning text Text cleaning removes URLs, HTML tags, emojis, and multiple spaces.
+  - Splitting the text into sentences
+  - Removing sentences with fewer than 3 words or more than 50 words.
+  - Removing sentences containing any non-Bangla characters.
+  - Deduplicating the corpus at the document level.
+  - Ensuring each text file contains one sentence per line, with each document separated by a blank line.
 ## Model Details
+The core architecture of BERT is based on the Transformer model, which utilizes self-attention
+mechanisms to capture long-range dependencies in text efficiently. During pre-training, BERT
+learns contextualized word embeddings by predicting missing words within sentences, a
+process known as masked language modeling. This allows BERT to understand words in the
+context of their surrounding words, leading to more meaningful and context-aware embeddings.
 This model is based on the BERT-Base architecture with 12 layers, 768 hidden size, 12 attention heads, and 110 million parameters. The model was trained on a corpus of 39 GB Bangla text data with a vocabulary size of 50k tokens. The model was trained for 1 million steps with a batch size of 440 and a learning rate of 5e-5. The model was trained on two NVIDIA GeForce A40 GPUs.
 ## How to use
 ```python
+from transformers import BertModel, BertTokenizer
+model = BertModel.from_pretrained("banglagov/banBERT-Base")
+tokenizer = BertTokenizer.from_pretrained("banglagov/banBERT-Base")
 text = "আমি বাংলায় পড়ি।"
 tokenized_text = tokenizer(text, return_tensors="pt")
 outputs = model(**tokenized_text)
+print(outputs)
 ```
+## Results
+| **Metric**          | **Train Loss** | **Eval Loss** | **Perplexity** | **NER**  | **POS**  | **Shallow Parsing** | **QA**  |
+|----------------------|----------------|---------------|----------------|----------|----------|----------------------|---------|
+| Precision            | -              | -             | -              | 0.8475   | 0.8838   | 0.7396               | -       |
+| Recall               | -              | -             | -              | 0.7390   | 0.8543   | 0.6858               | -       |
+| Macro F1             | -              | -             | -              | 0.7786   | 0.8611   | 0.7117               | 0.7396  |
+| Exact Match          | -              | -             | -              | -        | -        | -                    | 0.6809  |
+| Loss                 | 1.8633         | 1.4681        | 4.3826         | -        | -        | -                    | -       |