---
language:
- cs
license: apache-2.0
---


# Model Description

<img src="https://hplt-project.org/_next/static/media/logo-hplt.d5e16ca5.svg" width=12.5%>


* **Language:** Czech
* **Developed by:** [HPLT](https://hplt-project.org/)
* **Paper:** [arxiv.org/abs/2511.01066](https://arxiv.org/abs/2511.01066)
* **Evaluation results:** [hf.co/datasets/HPLT/2508-datasets-evals](https://huggingface.co/datasets/HPLT/2508-datasets-evals) using [HPLT-E](https://github.com/hplt-project/hplt-e/tree/main)
* **License:** Apache 2.0

The HPLT's Llama-2b collection comprises monolingual decoder-only language models pretrained by the [HPLT](https://hplt-project.org/) team as part of the third release.

The models are released as artifacts of our ablation studies on evaluating different corpora and sampling strategies across multiple languages:

* [**⚖️ HPLT Pre-3.0 Comparison**](https://github.com/hplt-project/hplt-e/tree/main/results/2505-deduplication): Comparison of data deduplication strategies on a pre-release version of HPLT 3.0 across nine selected languages (HPLT 3.0 pre-release).
* [**📚 Corpora Comparison**](https://github.com/hplt-project/hplt-e/tree/main/results/2508-datasets): Evaluation of HPLT 2.0, HPLT 3.0, FineWeb 2.1.0, and MADLAD-400 1.0 on nine selected languages (HPLT 3.0 release).
* [**🧰 Web Document Scorer (WDS) Comparison**](https://github.com/hplt-project/hplt-e/tree/main/results/2508-wds): Analysis of HPLT 3.0 corpora sampled using different WDS thresholds, focusing on Spanish and French (HPLT 3.0 release).

Please find more details in [our GitHub repository](https://github.com/hplt-project/hplt-e/tree/main) and [pre-print](https://arxiv.org/abs/2511.01066).

### Model Architecture

All models follow the Llama architecture with 24 layers, 32 attention heads, and a sequence length of 2048. The tokenizer is Gemma-3 with the vocabulary size of 262K tokens.

### Pretraining Corpus

This model is pretrained on 100B tokens from HPLT 3.0 from scratch. For lower-resource languages with less than 100B tokens of available data, datasets are uniformly upsampled (repeated) following [Muennighoff et al. (2023)](https://openreview.net/forum?id=j5BuTrEj35). Pretraining is run using the Megatron-LM framework on the LUMI supercomputer, employing 16 AMD MI250x nodes.

## Intended Use

**Intended Use Cases:** The model is intended for research use in Czech and reproducibility purposes. Since this model is *only* pretrained, its performance can be potentially improved in a variety of natural language understanding and generation tasks using post-training data.

**Out of Scope:** Model usage in languages beyond the explicitly referenced as supported in this model card.

## How to use

This repository contains the following intermediate checkpoints due to limited quota resources:

- `2B`
- `10B`
- `21B`
- `31B`
- `40B`
- `50B`
- `61B`
- `71B`
- `80B`
- `90B`
- `main`

The other checkpoints can be provided upon request.

### Use with Transformers

You can run the inference using the Transformers pipeline abstraction or by leveraging the `Auto` classes with the generate() function.

```python
import torch
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="HPLT/hplt-3.0-ces_Latn-llama-2b-100bt", 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)
```

Specific intermediate checkpoint can be accessed using the `revision` argument when loading the model.

```python
from transformers import AutoModelForCausalLM
import torch

revision = "10B"

model = AutoModelForCausalLM.from_pretrained(
    "HPLT/hplt-3.0-ces_Latn-llama-2b-100bt",
    torch_dtype=torch.bfloat16,
    revision=revision,
    device_map="auto"
)
```

## Cite us

```
@article{oepen2025hplt,
  title={HPLT\~{} 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono-and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models},
  author={Oepen, Stephan and Arefev, Nikolay and Aulamo, Mikko and Ba{\~n}{\'o}n, Marta and Buljan, Maja and Burchell, Laurie and Charpentier, Lucas and Chen, Pinzhen and Fedorova, Mariya and de Gibert, Ona and others},
  journal={arXiv preprint arXiv:2511.01066},
  year={2025}
}
```