--- language: - cs license: apache-2.0 --- # Model Description * **Language:** Czech * **Developed by:** [HPLT](https://hplt-project.org/) * **Paper:** [arxiv.org/abs/2511.01066](https://arxiv.org/abs/2511.01066) * **Evaluation results:** [hf.co/datasets/HPLT/2508-datasets-evals](https://huggingface.co/datasets/HPLT/2508-datasets-evals) using [HPLT-E](https://github.com/hplt-project/hplt-e/tree/main) * **License:** Apache 2.0 The HPLT's Llama-2b collection comprises monolingual decoder-only language models pretrained by the [HPLT](https://hplt-project.org/) team as part of the third release. The models are released as artifacts of our ablation studies on evaluating different corpora and sampling strategies across multiple languages: * [**⚖️ HPLT Pre-3.0 Comparison**](https://github.com/hplt-project/hplt-e/tree/main/results/2505-deduplication): Comparison of data deduplication strategies on a pre-release version of HPLT 3.0 across nine selected languages (HPLT 3.0 pre-release). * [**📚 Corpora Comparison**](https://github.com/hplt-project/hplt-e/tree/main/results/2508-datasets): Evaluation of HPLT 2.0, HPLT 3.0, FineWeb 2.1.0, and MADLAD-400 1.0 on nine selected languages (HPLT 3.0 release). * [**🧰 Web Document Scorer (WDS) Comparison**](https://github.com/hplt-project/hplt-e/tree/main/results/2508-wds): Analysis of HPLT 3.0 corpora sampled using different WDS thresholds, focusing on Spanish and French (HPLT 3.0 release). Please find more details in [our GitHub repository](https://github.com/hplt-project/hplt-e/tree/main) and [pre-print](https://arxiv.org/abs/2511.01066). ### Model Architecture All models follow the Llama architecture with 24 layers, 32 attention heads, and a sequence length of 2048. The tokenizer is Gemma-3 with the vocabulary size of 262K tokens. ### Pretraining Corpus This model is pretrained on 100B tokens from HPLT 3.0 from scratch. For lower-resource languages with less than 100B tokens of available data, datasets are uniformly upsampled (repeated) following [Muennighoff et al. (2023)](https://openreview.net/forum?id=j5BuTrEj35). Pretraining is run using the Megatron-LM framework on the LUMI supercomputer, employing 16 AMD MI250x nodes. ## Intended Use **Intended Use Cases:** The model is intended for research use in Czech and reproducibility purposes. Since this model is *only* pretrained, its performance can be potentially improved in a variety of natural language understanding and generation tasks using post-training data. **Out of Scope:** Model usage in languages beyond the explicitly referenced as supported in this model card. ## How to use This repository contains the following intermediate checkpoints due to limited quota resources: - `2B` - `10B` - `21B` - `31B` - `40B` - `50B` - `61B` - `71B` - `80B` - `90B` - `main` The other checkpoints can be provided upon request. ### Use with Transformers You can run the inference using the Transformers pipeline abstraction or by leveraging the `Auto` classes with the generate() function. ```python import torch from transformers import pipeline pipe = pipeline( "text-generation", model="HPLT/hplt-3.0-ces_Latn-llama-2b-100bt", torch_dtype=torch.bfloat16, device_map="auto" ) ``` Specific intermediate checkpoint can be accessed using the `revision` argument when loading the model. ```python from transformers import AutoModelForCausalLM import torch revision = "10B" model = AutoModelForCausalLM.from_pretrained( "HPLT/hplt-3.0-ces_Latn-llama-2b-100bt", torch_dtype=torch.bfloat16, revision=revision, device_map="auto" ) ``` ## Cite us ``` @article{oepen2025hplt, title={HPLT\~{} 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono-and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models}, author={Oepen, Stephan and Arefev, Nikolay and Aulamo, Mikko and Ba{\~n}{\'o}n, Marta and Buljan, Maja and Burchell, Laurie and Charpentier, Lucas and Chen, Pinzhen and Fedorova, Mariya and de Gibert, Ona and others}, journal={arXiv preprint arXiv:2511.01066}, year={2025} } ```