Safetensors
Czech
vmkhlv's picture
Add model card
5ed752f
|
Raw
History Blame
4.02 kB
---
language:
- cs
license: apache-2.0
---
# Model Description
<img src="https://hplt-project.org/_next/static/media/logo-hplt.d5e16ca5.svg" width=12.5%>
* **Language:** Czech
* **Developed by:** [HPLT](https://hplt-project.org/)
* **Paper:** [arxiv.org/abs/2511.01066](https://arxiv.org/abs/2511.01066)
* **Evaluation results:** [hf.co/datasets/HPLT/2508-datasets-evals](https://huggingface.co/datasets/HPLT/2508-datasets-evals) using [HPLT-E](https://github.com/hplt-project/hplt-e/tree/main)
* **License:** Apache 2.0
The HPLT's Llama-2b collection comprises monolingual decoder-only language models pretrained by the [HPLT](https://hplt-project.org/) team as part of the third release.
The models are released as artifacts of our ablation studies on evaluating different corpora and sampling strategies across multiple languages:
* [**⚖️ HPLT Pre-3.0 Comparison**](https://github.com/hplt-project/hplt-e/tree/main/results/2505-deduplication): Comparison of data deduplication strategies on a pre-release version of HPLT 3.0 across nine selected languages (HPLT 3.0 pre-release).
* [**📚 Corpora Comparison**](https://github.com/hplt-project/hplt-e/tree/main/results/2508-datasets): Evaluation of HPLT 2.0, HPLT 3.0, FineWeb 2.1.0, and MADLAD-400 1.0 on nine selected languages (HPLT 3.0 release).
* [**🧰 Web Document Scorer (WDS) Comparison**](https://github.com/hplt-project/hplt-e/tree/main/results/2508-wds): Analysis of HPLT 3.0 corpora sampled using different WDS thresholds, focusing on Spanish and French (HPLT 3.0 release).
Please find more details in [our GitHub repository](https://github.com/hplt-project/hplt-e/tree/main) and [pre-print](https://arxiv.org/abs/2511.01066).
### Model Architecture
All models follow the Llama architecture with 24 layers, 32 attention heads, and a sequence length of 2048. The tokenizer is Gemma-3 with the vocabulary size of 262K tokens.
### Pretraining Corpus
This model is pretrained on 100B tokens from HPLT 3.0 from scratch. For lower-resource languages with less than 100B tokens of available data, datasets are uniformly upsampled (repeated) following [Muennighoff et al. (2023)](https://openreview.net/forum?id=j5BuTrEj35). Pretraining is run using the Megatron-LM framework on the LUMI supercomputer, employing 16 AMD MI250x nodes.
## Intended Use
**Intended Use Cases:** The model is intended for research use in Czech and reproducibility purposes. Since this model is *only* pretrained, its performance can be potentially improved in a variety of natural language understanding and generation tasks using post-training data.
**Out of Scope:** Model usage in languages beyond the explicitly referenced as supported in this model card.
## How to use
This repository contains all our intermediate checkpoints.
### Use with Transformers
You can run the inference using the Transformers pipeline abstraction or by leveraging the `Auto` classes with the generate() function.
```python
import torch
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="HPLT/hplt-3.0-ces_Latn-llama-2b-100bt",
torch_dtype=torch.bfloat16,
device_map="auto"
)
```
Specific intermediate checkpoint can be accessed using the `revision` argument when loading the model.
```python
from transformers import AutoModelForCausalLM
import torch
revision = "10B"
model = AutoModelForCausalLM.from_pretrained(
"HPLT/hplt-3.0-ces_Latn-llama-2b-100bt",
torch_dtype=torch.bfloat16,
revision=revision,
device_map="auto"
)
```
## Cite us
```
@article{oepen2025hplt,
title={HPLT\~{} 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono-and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models},
author={Oepen, Stephan and Arefev, Nikolay and Aulamo, Mikko and Ba{\~n}{\'o}n, Marta and Buljan, Maja and Burchell, Laurie and Charpentier, Lucas and Chen, Pinzhen and Fedorova, Mariya and de Gibert, Ona and others},
journal={arXiv preprint arXiv:2511.01066},
year={2025}
}
```