| --- |
| language: |
| - cs |
| license: apache-2.0 |
| --- |
| |
|
|
| # Model Description |
|
|
| <img src="https://hplt-project.org/_next/static/media/logo-hplt.d5e16ca5.svg" width=12.5%> |
|
|
|
|
| * **Language:** Czech |
| * **Developed by:** [HPLT](https://hplt-project.org/) |
| * **Paper:** [arxiv.org/abs/2511.01066](https://arxiv.org/abs/2511.01066) |
| * **Evaluation results:** [hf.co/datasets/HPLT/2508-datasets-evals](https://huggingface.co/datasets/HPLT/2508-datasets-evals) using [HPLT-E](https://github.com/hplt-project/hplt-e/tree/main) |
| * **License:** Apache 2.0 |
|
|
| The HPLT's Llama-2b collection comprises monolingual decoder-only language models pretrained by the [HPLT](https://hplt-project.org/) team as part of the third release. |
|
|
| The models are released as artifacts of our ablation studies on evaluating different corpora and sampling strategies across multiple languages: |
|
|
| * [**⚖️ HPLT Pre-3.0 Comparison**](https://github.com/hplt-project/hplt-e/tree/main/results/2505-deduplication): Comparison of data deduplication strategies on a pre-release version of HPLT 3.0 across nine selected languages (HPLT 3.0 pre-release). |
| * [**📚 Corpora Comparison**](https://github.com/hplt-project/hplt-e/tree/main/results/2508-datasets): Evaluation of HPLT 2.0, HPLT 3.0, FineWeb 2.1.0, and MADLAD-400 1.0 on nine selected languages (HPLT 3.0 release). |
| * [**🧰 Web Document Scorer (WDS) Comparison**](https://github.com/hplt-project/hplt-e/tree/main/results/2508-wds): Analysis of HPLT 3.0 corpora sampled using different WDS thresholds, focusing on Spanish and French (HPLT 3.0 release). |
|
|
| Please find more details in [our GitHub repository](https://github.com/hplt-project/hplt-e/tree/main) and [pre-print](https://arxiv.org/abs/2511.01066). |
|
|
| ### Model Architecture |
|
|
| All models follow the Llama architecture with 24 layers, 32 attention heads, and a sequence length of 2048. The tokenizer is Gemma-3 with the vocabulary size of 262K tokens. |
|
|
| ### Pretraining Corpus |
|
|
| This model is pretrained on 100B tokens from HPLT 3.0 from scratch. For lower-resource languages with less than 100B tokens of available data, datasets are uniformly upsampled (repeated) following [Muennighoff et al. (2023)](https://openreview.net/forum?id=j5BuTrEj35). Pretraining is run using the Megatron-LM framework on the LUMI supercomputer, employing 16 AMD MI250x nodes. |
|
|
| ## Intended Use |
|
|
| **Intended Use Cases:** The model is intended for research use in Czech and reproducibility purposes. Since this model is *only* pretrained, its performance can be potentially improved in a variety of natural language understanding and generation tasks using post-training data. |
|
|
| **Out of Scope:** Model usage in languages beyond the explicitly referenced as supported in this model card. |
|
|
| ## How to use |
|
|
| This repository contains all our intermediate checkpoints. |
|
|
| ### Use with Transformers |
|
|
| You can run the inference using the Transformers pipeline abstraction or by leveraging the `Auto` classes with the generate() function. |
|
|
| ```python |
| import torch |
| from transformers import pipeline |
| |
| pipe = pipeline( |
| "text-generation", |
| model="HPLT/hplt-3.0-ces_Latn-llama-2b-100bt", |
| torch_dtype=torch.bfloat16, |
| device_map="auto" |
| ) |
| ``` |
|
|
| Specific intermediate checkpoint can be accessed using the `revision` argument when loading the model. |
|
|
| ```python |
| from transformers import AutoModelForCausalLM |
| import torch |
| |
| revision = "10B" |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| "HPLT/hplt-3.0-ces_Latn-llama-2b-100bt", |
| torch_dtype=torch.bfloat16, |
| revision=revision, |
| device_map="auto" |
| ) |
| ``` |
|
|
| ## Cite us |
|
|
| ``` |
| @article{oepen2025hplt, |
| title={HPLT\~{} 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono-and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models}, |
| author={Oepen, Stephan and Arefev, Nikolay and Aulamo, Mikko and Ba{\~n}{\'o}n, Marta and Buljan, Maja and Burchell, Laurie and Charpentier, Lucas and Chen, Pinzhen and Fedorova, Mariya and de Gibert, Ona and others}, |
| journal={arXiv preprint arXiv:2511.01066}, |
| year={2025} |
| } |
| ``` |