---
language:
  - kk
license: apache-2.0
library_name: transformers
tags:
  - llama
  - kazakh
  - causal-lm
  - from-scratch
  - soz
pipeline_tag: text-generation
---

# Kazakh LLaMA 50M (Soz)

A LLaMA-architecture language model with ~50M parameters trained from scratch on a large-scale Kazakh corpus.

## Overview

| Property | Value |
|----------|-------|
| **Parameters** | ~50M |
| **Architecture** | LLaMA (RoPE, SwiGLU, RMSNorm) |
| **Hidden dim** | 576 |
| **Layers** | 8 |
| **Attention heads** | 8 |
| **Training data** | [kz-transformers/multidomain-kazakh-dataset](https://huggingface.co/datasets/kz-transformers/multidomain-kazakh-dataset) |
| **Tokenizer** | [kazakh-bpe-32k](https://huggingface.co/stukenov/kazakh-bpe-32k) |
| **License** | Apache 2.0 |

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stukenov/kazakh-bpe-32k")
model = AutoModelForCausalLM.from_pretrained("stukenov/kazakh-llama-50m")

input_ids = tokenizer("Қазақстан — ", return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=50, do_sample=True, temperature=0.8)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```

## Training Details

- Trained from scratch on the multidomain Kazakh dataset (23.6M samples)
- Optimizer: AdamW
- Precision: bfloat16
- Hardware: NVIDIA A10 GPUs

## Project

Part of the [Soz — Kazakh Language Models](https://github.com/stukenov) project, a research effort to build open-source language models for Kazakh.

## Citation

```bibtex
@misc{tukenov2026soz,
  title={Soz: Small Language Models for Kazakh},
  author={Tukenov, Saken},
  year={2026},
  url={https://huggingface.co/stukenov/kazakh-llama-50m}
}
```

## License

Apache 2.0