--- language: - bn license: apache-2.0 library_name: transformers tags: - text-generation-inference datasets: - Polygl0t/gigakriya-v1 metrics: - perplexity pipeline_tag: text-generation model-index: - name: GigaKriya-ablation-NonEDU-1.5B results: - task: type: text-generation name: Text Generation dataset: name: ARC Challenge (Bengali) type: Polygl0t/ARC-poly split: test args: num_few_shot: 5 metrics: - type: acc_norm value: 24.29 name: accuracy (normalized) source: url: https://github.com/Polygl0t/lm-evaluation-harness name: Language Model Evaluation Harness (branch=polyglot_harness_bengali) - task: type: text-generation name: Text Generation dataset: name: HellaSwag (Bengali) type: Polygl0t/HellaSwag-poly split: validation args: num_few_shot: 5 metrics: - type: acc_norm value: 29.13 name: accuracy (normalized) source: url: https://github.com/Polygl0t/lm-evaluation-harness name: Language Model Evaluation Harness (branch=polyglot_harness_bengali) - task: type: text-generation name: Text Generation dataset: name: MMLU (Bengali) type: Polygl0t/MMLU-poly split: test args: num_few_shot: 5 metrics: - type: acc value: 24.49 name: accuracy source: url: https://github.com/Polygl0t/lm-evaluation-harness name: Language Model Evaluation Harness (branch=polyglot_harness_bengali) - task: type: text-generation name: Text Generation dataset: name: BOOLQ (Bengali) type: Polygl0t/BOOLQ split: test args: num_few_shot: 5 metrics: - type: acc_norm value: 51.85 name: accuracy (normalized) source: url: https://github.com/Polygl0t/lm-evaluation-harness name: Language Model Evaluation Harness (branch=polyglot_harness_bengali) - task: type: text-generation name: Text Generation dataset: name: PIQA (Bengali) type: Polygl0t/PIQA split: test args: num_few_shot: 5 metrics: - type: acc value: 48.96 name: accuracy source: url: https://github.com/Polygl0t/lm-evaluation-harness name: Language Model Evaluation Harness (branch=polyglot_harness_bengali) - task: type: text-generation name: Text Generation dataset: name: OpenBookQA (Bengali) type: Polygl0t/OpenBookQA split: test args: num_few_shot: 5 metrics: - type: acc_norm value: 20.72 name: accuracy (normalized) source: url: https://github.com/Polygl0t/lm-evaluation-harness name: Language Model Evaluation Harness (branch=polyglot_harness_bengali) - task: type: text-generation name: Text Generation dataset: name: CommonsenseQA (Bengali) type: Polygl0t/CommonsenseQA split: test args: num_few_shot: 5 metrics: - type: acc_norm value: 28.09 name: accuracy (normalized) source: url: https://github.com/Polygl0t/lm-evaluation-harness name: Language Model Evaluation Harness (branch=polyglot_harness_bengali) - task: type: text-generation name: Text Generation dataset: name: Bangla MMLU type: Polygl0t/BanglaMMLU split: test args: num_few_shot: 5 metrics: - type: acc_norm value: 24.74 name: accuracy (normalized) source: url: https://github.com/Polygl0t/lm-evaluation-harness name: Language Model Evaluation Harness (branch=polyglot_harness_bengali) --- # GigaKriya-ablation-NonEDU-1.5B ## Model Summary **[GigaKriya-ablation-NonEDU-1.5B](https://huggingface.co/Polygl0t/GigaKriya-ablation-NonEDU-1.5B)** is a decoder-transformer natively pretrained in Bengali. This model is part of an ablation study to measure the impact of our educational data filtering/augmentation strategy on the downstream performance of models trained with [GigaKriya](https://huggingface.co/datasets/Polygl0t/GigaKriya-v1). GigaKriya-ablation-NonEDU-1.5B was trained with ~34 billion tokens, those being a mixture of the non-educational portion of GigaKriya (i.e., samples with an Edu Score < 3). This model has 1.5 billion parameters and a context length of 4096 tokens. ## Details - **Architecture:** a Transformer-based model ([`llama`](https://huggingface.co/docs/transformers/main/en/model_doc/llama)) - **Size:** 1,510,066,176 parameters - **Context length:** 4096 tokens - **Dataset(s):** - [GigaKriya](https://huggingface.co/datasets/Polygl0t/GigaKriya-v1) (non-educational subset, Edu Score < 3) - **Language(s):** Bengali - **Batch size:** 2,097,152 tokens - **Number of steps:** 16,000 - **GPU:** 16 NVIDIA A40 (48 GB) - **Training time**: ~60.49 hours - **Emissions:** 94.44 KgCO2 (Germany) - **Total energy consumption:** 247.90 kWh This repository has the [source code](https://github.com/Polygl0t/llm-foundry) used to train this model. The complete configuration used for training is available in the following config file: - Single stage (linear warmup with cosine decay): [training_config.yaml](training_config.yaml) The main branch of this repository contains the final checkpoint saved at step 16,000. All other checkpoints are available as separate branches. To load a specific checkpoint, you can use the following code snippet: ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "Polygl0t/GigaKriya-ablation-NonEDU-1.5B" revision = "step-2000" # Change this to the desired checkpoint branch tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, revision=revision) ``` Or, you can access all the revisions for the models via the following code snippet: ```python from huggingface_hub import list_repo_refs out = list_repo_refs("Polygl0t/GigaKriya-ablation-NonEDU-1.5B") branches = [b.name for b in out.branches] print(branches) ``` ## Intended Uses The primary intended use of this model is to serve as a baseline for evaluating the impact of data quality and filtering on Bengali language model performance. Researchers and practitioners can use this model as a reference point for further ablation studies or for comparison with other models trained on different data mixtures. ## Basic usage ```python from transformers import GenerationConfig, TextGenerationPipeline, AutoTokenizer, AutoModelForCausalLM import torch # Specify the model and tokenizer model_id = "Polygl0t/GigaKriya-ablation-NonEDU-1.5B" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id) # Specify the generation parameters as you like generation_config = GenerationConfig( **{ "do_sample": True, "max_new_tokens": 150, "renormalize_logits": True, "repetition_penalty": 1.2, "temperature": 0.1, "top_k": 50, "top_p": 1.0, "use_cache": True, } ) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") generator = TextGenerationPipeline(model=model, task="text-generation", tokenizer=tokenizer, device=device) # Generate text prompt = "ভারতের রাজধানী কী ?" completion = generator(prompt, generation_config=generation_config) print(completion[0]['generated_text']) ``` ## Evaluations Figures below show the per-benchmark performance of [GigaKriya-ablation-EDU-1.5B](https://huggingface.co/Polygl0t/GigaKriya-ablation-EDU-1.5B) (educational subset, Edu Score >= 3) compared to [GigaKriya-ablation-NonEDU-1.5B](https://huggingface.co/Polygl0t/GigaKriya-ablation-NonEDU-1.5B) (non educational subset, Edu Score < 3). *GigaKriya-Edu* outperforms *GigaKriya-NonEdu* on 7 of 8 benchmarks and achieves a higher NPM score. These results suggest that training on educationally curated content consistently yields stronger language understanding.
🏆 HellaSwag ![hellaswag](./.plots/hellaswag.png)
🏆 ARC Challenge ![arc_challenge](./.plots/arc.png)
🏆 MMLU ![mmlu](./.plots/mmlu.png)
🏆 Bangla MMLU ![bangla_mmlu](./.plots/bangla_mmlu.png)
🏆 BoolQ ![boolq](./.plots/boolq.png)
🏆PIQA ![piqa](./.plots/piqa.png)
🏆CommonsenseQA ![commonsense_qa](./.plots/commonsens_qa.png)
🏆OpenbookQA ![openbook_qa](./.plots/openbook_qa.png)
Aggregate NPM Across Benchmarks ![NPM](./.plots/npm.png)
## Cite as 🤗 ```latex @misc{fatimah2026liltii, title={{LilTii: A 0.6B Bengali Language Model that Outperforms Qwen}}, author={Shiza Fatimah and Aniket Sen and Sophia Falk and Florian Mai and Lucie Flek and Nicholas Kluge Corr{\^e}a}, year={2026}, howpublished={\url{https://hf.co/blog/Polygl0t/liltii}} } ``` ## Aknowlegments Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments. We also gratefully acknowledge the granted access to the [Marvin cluster](https://www.hpc.uni-bonn.de/en/systems/marvin) hosted by [University of Bonn](https://www.uni-bonn.de/en) along with the support provided by its High Performance Computing & Analytics Lab. ## License This model is licensed under the Apache License, Version 2.0. For more details, see the [LICENSE](LICENSE) file.