--- language: - kk license: apache-2.0 library_name: transformers tags: - llama - kazakh - causal-lm - from-scratch - soz pipeline_tag: text-generation --- # Kazakh LLaMA 50M (Soz) A LLaMA-architecture language model with ~50M parameters trained from scratch on a large-scale Kazakh corpus. ## Overview | Property | Value | |----------|-------| | **Parameters** | ~50M | | **Architecture** | LLaMA (RoPE, SwiGLU, RMSNorm) | | **Hidden dim** | 576 | | **Layers** | 8 | | **Attention heads** | 8 | | **Training data** | [kz-transformers/multidomain-kazakh-dataset](https://huggingface.co/datasets/kz-transformers/multidomain-kazakh-dataset) | | **Tokenizer** | [kazakh-bpe-32k](https://huggingface.co/stukenov/kazakh-bpe-32k) | | **License** | Apache 2.0 | ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("stukenov/kazakh-bpe-32k") model = AutoModelForCausalLM.from_pretrained("stukenov/kazakh-llama-50m") input_ids = tokenizer("Қазақстан — ", return_tensors="pt").input_ids output = model.generate(input_ids, max_new_tokens=50, do_sample=True, temperature=0.8) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ## Training Details - Trained from scratch on the multidomain Kazakh dataset (23.6M samples) - Optimizer: AdamW - Precision: bfloat16 - Hardware: NVIDIA A10 GPUs ## Project Part of the [Soz — Kazakh Language Models](https://github.com/stukenov) project, a research effort to build open-source language models for Kazakh. ## Citation ```bibtex @misc{tukenov2026soz, title={Soz: Small Language Models for Kazakh}, author={Tukenov, Saken}, year={2026}, url={https://huggingface.co/stukenov/kazakh-llama-50m} } ``` ## License Apache 2.0