You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Kazakh LLaMA 50M — Balanced (Soz)

A LLaMA-architecture language model with ~50M parameters trained on a domain-balanced Kazakh corpus.

Overview

Property Value
Parameters ~50M
Architecture LLaMA (RoPE, SwiGLU, RMSNorm)
Training data Domain-balanced Kazakh corpus
License Apache 2.0

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stukenov/kazakh-gpt2-50k")
model = AutoModelForCausalLM.from_pretrained("stukenov/kazakh-llama-50m-balanced")

input_ids = tokenizer("Қазақстан — ", return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=50, do_sample=True, temperature=0.8)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Details

  • Trained from scratch on a balanced multi-domain Kazakh corpus
  • Precision: bfloat16

Project

Part of the Soz — Kazakh Language Models project, a research effort to build open-source language models for Kazakh.

Citation

@misc{tukenov2026soz,
  title={Soz: Small Language Models for Kazakh},
  author={Tukenov, Saken},
  year={2026},
  url={https://huggingface.co/stukenov/kazakh-llama-50m-balanced}
}

License

Apache 2.0

Downloads last month
-
Safetensors
Model size
50.3M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including stukenov/sozkz-core-llama-50m-kk-balanced-v1