File size: 10,507 Bytes
ce4b5c1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3f1ab16
ce4b5c1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
---
language:
  - bn
license: apache-2.0
library_name: transformers
tags:
  - text-generation-inference
datasets:
  - Polygl0t/gigakriya-v1
metrics:
  - perplexity
pipeline_tag: text-generation
model-index:
  - name: GigaKriya-ablation-NonEDU-1.5B
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: ARC Challenge (Bengali)
          type: Polygl0t/ARC-poly
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc_norm
            value: 24.29
            name: accuracy (normalized)
        source:
          url: https://github.com/Polygl0t/lm-evaluation-harness
          name: Language Model Evaluation Harness (branch=polyglot_harness_bengali)
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (Bengali)
          type: Polygl0t/HellaSwag-poly
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc_norm
            value: 29.13
            name: accuracy (normalized)
        source:
          url: https://github.com/Polygl0t/lm-evaluation-harness
          name: Language Model Evaluation Harness (branch=polyglot_harness_bengali)
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (Bengali)
          type: Polygl0t/MMLU-poly
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 24.49
            name: accuracy
        source:
          url: https://github.com/Polygl0t/lm-evaluation-harness
          name: Language Model Evaluation Harness (branch=polyglot_harness_bengali)
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BOOLQ (Bengali)
          type: Polygl0t/BOOLQ
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc_norm
            value: 51.85
            name: accuracy (normalized)
        source:
          url: https://github.com/Polygl0t/lm-evaluation-harness
          name: Language Model Evaluation Harness (branch=polyglot_harness_bengali)
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: PIQA (Bengali)
          type: Polygl0t/PIQA
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 48.96
            name: accuracy
        source:
          url: https://github.com/Polygl0t/lm-evaluation-harness
          name: Language Model Evaluation Harness (branch=polyglot_harness_bengali)
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: OpenBookQA (Bengali)
          type: Polygl0t/OpenBookQA
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc_norm
            value: 20.72
            name: accuracy (normalized)
        source:
          url: https://github.com/Polygl0t/lm-evaluation-harness
          name: Language Model Evaluation Harness (branch=polyglot_harness_bengali)
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: CommonsenseQA (Bengali)
          type: Polygl0t/CommonsenseQA
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc_norm
            value: 28.09
            name: accuracy (normalized)
        source:
          url: https://github.com/Polygl0t/lm-evaluation-harness
          name: Language Model Evaluation Harness (branch=polyglot_harness_bengali)
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Bangla MMLU
          type: Polygl0t/BanglaMMLU
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc_norm
            value: 24.74
            name: accuracy (normalized)
        source:
          url: https://github.com/Polygl0t/lm-evaluation-harness
          name: Language Model Evaluation Harness (branch=polyglot_harness_bengali)
---

# GigaKriya-ablation-NonEDU-1.5B

## Model Summary

**[GigaKriya-ablation-NonEDU-1.5B](https://huggingface.co/Polygl0t/GigaKriya-ablation-NonEDU-1.5B)** is a decoder-transformer natively pretrained in Bengali. This model is part of an ablation study to measure the impact of our educational data filtering/augmentation strategy on the downstream performance of models trained with [GigaKriya](https://huggingface.co/datasets/Polygl0t/GigaKriya-v1). GigaKriya-ablation-NonEDU-1.5B was trained with ~34 billion tokens, those being a mixture of the non-educational portion of GigaKriya (i.e., samples with an Edu Score < 3). This model has 1.5 billion parameters and a context length of 4096 tokens.

## Details

- **Architecture:** a Transformer-based model ([`llama`](https://huggingface.co/docs/transformers/main/en/model_doc/llama))
- **Size:** 1,510,066,176 parameters
- **Context length:** 4096 tokens
- **Dataset(s):**
  - [GigaKriya](https://huggingface.co/datasets/Polygl0t/GigaKriya-v1) (non-educational subset, Edu Score < 3)
- **Language(s):** Bengali
- **Batch size:** 2,097,152 tokens
- **Number of steps:** 16,000
- **GPU:** 16 NVIDIA A40 (48 GB)
- **Training time**: ~60.49  hours
- **Emissions:** 94.44 KgCO2 (Germany)
- **Total energy consumption:** 247.90 kWh

This repository has the [source code](https://github.com/Polygl0t/llm-foundry) used to train this model. The complete configuration used for training is available in the following config file:

- Single stage (linear warmup with cosine decay): [training_config.yaml](training_config.yaml)

The main branch of this repository contains the final checkpoint saved at step 16,000. All other checkpoints are available as separate branches. To load a specific checkpoint, you can use the following code snippet:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Polygl0t/GigaKriya-ablation-NonEDU-1.5B"
revision = "step-2000"  # Change this to the desired checkpoint branch
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, revision=revision)
```

Or, you can access all the revisions for the models via the following code snippet:

```python
from huggingface_hub import list_repo_refs
out = list_repo_refs("Polygl0t/GigaKriya-ablation-NonEDU-1.5B")
branches = [b.name for b in out.branches]
print(branches)
```

## Intended Uses

The primary intended use of this model is to serve as a baseline for evaluating the impact of data quality and filtering on Bengali language model performance. Researchers and practitioners can use this model as a reference point for further ablation studies or for comparison with other models trained on different data mixtures.

## Basic usage

```python
from transformers import GenerationConfig, TextGenerationPipeline, AutoTokenizer, AutoModelForCausalLM
import torch

# Specify the model and tokenizer
model_id = "Polygl0t/GigaKriya-ablation-NonEDU-1.5B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Specify the generation parameters as you like
generation_config = GenerationConfig(
    **{
    "do_sample": True,
    "max_new_tokens": 150,
    "renormalize_logits": True,
    "repetition_penalty": 1.2,
    "temperature": 0.1,
    "top_k": 50,
    "top_p": 1.0,
    "use_cache": True,
  }
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
generator = TextGenerationPipeline(model=model, task="text-generation", tokenizer=tokenizer, device=device)

# Generate text
prompt = "ভারতের রাজধানী কী ?"
completion = generator(prompt, generation_config=generation_config)
print(completion[0]['generated_text'])
```

## Evaluations

Figures below show the per-benchmark performance of [GigaKriya-ablation-EDU-1.5B](https://huggingface.co/Polygl0t/GigaKriya-ablation-EDU-1.5B) (educational subset, Edu Score >= 3) compared to [GigaKriya-ablation-NonEDU-1.5B](https://huggingface.co/Polygl0t/GigaKriya-ablation-NonEDU-1.5B) (non educational subset, Edu Score < 3). *GigaKriya-Edu* outperforms *GigaKriya-NonEdu* on 7 of 8 benchmarks and achieves a higher NPM score. These results suggest that training on educationally curated content consistently yields stronger language understanding.

<details>
<summary><b>🏆 HellaSwag</b></summary>
  
![hellaswag](./.plots/hellaswag.png)
</details>

<details>
<summary><b>🏆 ARC Challenge</b></summary>
  
![arc_challenge](./.plots/arc.png)
</details>

<details>
<summary><b>🏆 MMLU</b></summary>
  
![mmlu](./.plots/mmlu.png)
</details>

<details>
<summary><b>🏆 Bangla MMLU</b></summary>
  
![bangla_mmlu](./.plots/bangla_mmlu.png)
</details>

<details>
<summary><b>🏆 BoolQ</b></summary>
  
![boolq](./.plots/boolq.png)
</details>

<details>
<summary><b>🏆PIQA</b></summary>
  
![piqa](./.plots/piqa.png)
</details>

<details>
<summary><b>🏆CommonsenseQA</b></summary>
  
![commonsense_qa](./.plots/commonsens_qa.png)
</details>

<details>
<summary><b>🏆OpenbookQA</b></summary>
  
![openbook_qa](./.plots/openbook_qa.png)
</details>


<details>
<summary><b>Aggregate NPM Across Benchmarks</b></summary>
  
![NPM](./.plots/npm.png)
</details>

## Cite as 🤗

```latex
@misc{fatimah2026liltii,
  title={{LilTii: A 0.6B Bengali Language Model that Outperforms Qwen}},
  author={Shiza Fatimah and Aniket Sen and Sophia Falk and Florian Mai and Lucie Flek and Nicholas Kluge Corr{\^e}a},
  year={2026},
  howpublished={\url{https://hf.co/blog/Polygl0t/liltii}}
}
```

## Aknowlegments

Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments.

We also gratefully acknowledge the granted access to the [Marvin cluster](https://www.hpc.uni-bonn.de/en/systems/marvin) hosted by [University of Bonn](https://www.uni-bonn.de/en) along with the support provided by its High Performance Computing & Analytics Lab.

## License

This model is licensed under the Apache License, Version 2.0. For more details, see the [LICENSE](LICENSE) file.