Add model card

5ed752f 7 months ago

4.02 kB

	---
	language:
	- cs
	license: apache-2.0
	---


	# Model Description

	<img src="https://hplt-project.org/_next/static/media/logo-hplt.d5e16ca5.svg" width=12.5%>


	* Language: Czech
	* Developed by: [HPLT](https://hplt-project.org/)
	* Paper: [arxiv.org/abs/2511.01066](https://arxiv.org/abs/2511.01066)
	* Evaluation results: [hf.co/datasets/HPLT/2508-datasets-evals](https://huggingface.co/datasets/HPLT/2508-datasets-evals) using [HPLT-E](https://github.com/hplt-project/hplt-e/tree/main)
	* License: Apache 2.0

	The HPLT's Llama-2b collection comprises monolingual decoder-only language models pretrained by the [HPLT](https://hplt-project.org/) team as part of the third release.

	The models are released as artifacts of our ablation studies on evaluating different corpora and sampling strategies across multiple languages:

	* [⚖️ HPLT Pre-3.0 Comparison](https://github.com/hplt-project/hplt-e/tree/main/results/2505-deduplication): Comparison of data deduplication strategies on a pre-release version of HPLT 3.0 across nine selected languages (HPLT 3.0 pre-release).
	* [📚 Corpora Comparison](https://github.com/hplt-project/hplt-e/tree/main/results/2508-datasets): Evaluation of HPLT 2.0, HPLT 3.0, FineWeb 2.1.0, and MADLAD-400 1.0 on nine selected languages (HPLT 3.0 release).
	* [🧰 Web Document Scorer (WDS) Comparison](https://github.com/hplt-project/hplt-e/tree/main/results/2508-wds): Analysis of HPLT 3.0 corpora sampled using different WDS thresholds, focusing on Spanish and French (HPLT 3.0 release).

	Please find more details in [our GitHub repository](https://github.com/hplt-project/hplt-e/tree/main) and [pre-print](https://arxiv.org/abs/2511.01066).

	### Model Architecture

	All models follow the Llama architecture with 24 layers, 32 attention heads, and a sequence length of 2048. The tokenizer is Gemma-3 with the vocabulary size of 262K tokens.

	### Pretraining Corpus

	This model is pretrained on 100B tokens from HPLT 3.0 from scratch. For lower-resource languages with less than 100B tokens of available data, datasets are uniformly upsampled (repeated) following [Muennighoff et al. (2023)](https://openreview.net/forum?id=j5BuTrEj35). Pretraining is run using the Megatron-LM framework on the LUMI supercomputer, employing 16 AMD MI250x nodes.

	## Intended Use

	Intended Use Cases: The model is intended for research use in Czech and reproducibility purposes. Since this model is only pretrained, its performance can be potentially improved in a variety of natural language understanding and generation tasks using post-training data.

	Out of Scope: Model usage in languages beyond the explicitly referenced as supported in this model card.

	## How to use

	This repository contains all our intermediate checkpoints.

	### Use with Transformers

	You can run the inference using the Transformers pipeline abstraction or by leveraging the `Auto` classes with the generate() function.

	```python
	import torch
	from transformers import pipeline

	pipe = pipeline(
	"text-generation",
	model="HPLT/hplt-3.0-ces_Latn-llama-2b-100bt",
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)
	```

	Specific intermediate checkpoint can be accessed using the `revision` argument when loading the model.

	```python
	from transformers import AutoModelForCausalLM
	import torch

	revision = "10B"

	model = AutoModelForCausalLM.from_pretrained(
	"HPLT/hplt-3.0-ces_Latn-llama-2b-100bt",
	torch_dtype=torch.bfloat16,
	revision=revision,
	device_map="auto"
	)
	```

	## Cite us

	```
	@article{oepen2025hplt,
	title={HPLT\~{} 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono-and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models},
	author={Oepen, Stephan and Arefev, Nikolay and Aulamo, Mikko and Ba{\~n}{\'o}n, Marta and Buljan, Maja and Burchell, Laurie and Charpentier, Lucas and Chen, Pinzhen and Fedorova, Mariya and de Gibert, Ona and others},
	journal={arXiv preprint arXiv:2511.01066},
	year={2025}
	}
	```