---
language:
  - tg
  - ru
  - en
license: other
license_name: tajikgpt-proprietary
license_link: https://tajikgpt.com/license
tags:
  - text-generation
  - conversational
  - multilingual
  - tajik
  - central-asia
  - instruction-tuned
pipeline_tag: text-generation
base_model: []
extra_gated_prompt: >-
  This model is proprietary and available via API only.
---

# TJ-1.0

**TJ-1.0** is the flagship instruction-tuned language model of the [TajikGPT](https://tajikgpt.com) platform, developed by [SoulLab](https://soullab.space). It is the first commercially deployed large language model with native support for the **Tajik language**, offering a balanced combination of quality, speed, and multilingual capability.

> **Note:** TJ-1.0 is available via API only and is not available for download or local deployment.

---

## Model Details

| Property | Value |
|---|---|
| **Developer** | SoulLab |
| **Model type** | Instruction-tuned Causal Language Model |
| **Architecture** | Decoder-only Transformer with Grouped Query Attention (GQA) |
| **Positional Encoding** | Rotary Position Embedding (RoPE) |
| **Tokenizer** | Byte-Pair Encoding (BPE), extended vocabulary for Tajik Cyrillic & Latin |
| **Fine-tuning** | Supervised Fine-Tuning (SFT) + RLHF (Reinforcement Learning from Human Feedback) |
| **Context window** | 128,000 tokens |
| **Max output tokens** | 8,192 tokens |
| **Knowledge cutoff** | Q3 2024 |
| **Languages** | Tajik (tg), Russian (ru), English (en), and 50+ languages |
| **License** | Proprietary — [TajikGPT Terms](https://tajikgpt.com/license) |
| **Training hardware** | NVIDIA A100 80GB, bf16 precision, PyTorch |

---

## Training Data

TJ-1.0 was trained on a curated multilingual corpus with a strong emphasis on Tajik-language content — the first dataset of this scale built specifically for Tajik NLP.

| Source Category | Description | Approx. Share |
|---|---|---|
| **Tajik Web Corpus** | News, blogs, forums, government portals in Tajik (Cyrillic & Latin) | 28% |
| **Tajik Literature & Culture** | Books, poetry, historical texts, folklore | 12% |
| **Tajik Legislation** | Laws, decrees, official government documents | 8% |
| **Multilingual Web** | High-quality filtered web data (Russian, English, and others) | 32% |
| **Instruction & Dialogue** | Human-written and synthetic instruction-following data | 14% |
| **Code** | Source code across major programming languages | 6% |

**Total corpus size:** ~2 trillion tokens  
**Data freshness:** Content up to Q3 2024  
**Processing:** Deduplication, quality filtering, language identification, PII removal applied to all sources.

---

## Intended Use

### Recommended Use Cases
- Multilingual chat and Q&A in Tajik, Russian, English and 50+ languages
- Document summarization and translation
- Creative writing, content creation and copywriting
- Education, tutoring and homework help
- Business communication and professional correspondence
- Data analysis, extraction and summarization
- Code generation and debugging

### Out-of-Scope Use Cases
- Generation of illegal, harmful, or deceptive content
- Medical diagnosis or legal advice without professional oversight
- Surveillance or targeting of individuals
- Automated high-stakes decision-making without human review
- Any use violating the [TajikGPT Terms of Service](https://tajikgpt.com/terms)

---

## Evaluation / Benchmarks

All benchmarks were evaluated using standard few-shot settings unless otherwise noted.

### General Benchmarks

| Benchmark | Score | # Shots | Metric |
|---|---|---|---|
| **MMLU** (Massive Multitask Language Understanding) | 72.1% | 5-shot | Accuracy |
| **MT-Bench** (Multi-turn instruction following) | 7.1 / 10 | 0-shot | GPT-4 Judge |
| **HumanEval** (Code generation) | 58.3% | 0-shot | pass@1 |
| **HellaSwag** (Commonsense reasoning) | 81.4% | 10-shot | Accuracy |

### Tajik Language Benchmarks

> These are the first published benchmarks for Tajik-language LLM evaluation.

| Benchmark | Score | Description |
|---|---|---|
| **TajikQA** | 78.4% | Open-domain Q&A in Tajik language |
| **TajikTranslate** | 81.2% BLEU | Tajik ↔ Russian translation |
| **TajikInstruct** | 74.6% | Instruction following in Tajik |

---

## How to Use

TJ-1.0 is available via the TajikGPT API. Install the SDK or use the REST API directly.

```bash
pip install tajikgpt
```

### Python SDK

```python
from tajikgpt import TajikGPT

client = TajikGPT(api_key="sk-tj-your-key")

response = client.chat.completions.create(
    model="tj-1.0",
    messages=[
        {"role": "system", "content": "Ты полезный помощник."},
        {"role": "user", "content": "Ба забони тоҷикӣ шарҳ деҳ: нейронӣ шабака чист?"}
    ]
)
print(response.choices[0].message.content)
```

### REST API

```bash
curl -X POST https://tajikgpt.com/api/tj/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-tj-your-key" \
  -d '{
    "model": "tj-1.0",
    "messages": [
      {"role": "user", "content": "Hello! What can you do?"}
    ],
    "max_tokens": 1024,
    "temperature": 0.7
  }'
```

---

## Limitations

1. **Dialectal Tajik:** The model performs best on standard literary Tajik (Cyrillic). Regional dialects and Latin-script Tajik may show reduced quality.
2. **Hallucinations:** Like all LLMs, TJ-1.0 may generate plausible-sounding but factually incorrect information. Always verify critical facts.
3. **Knowledge cutoff:** The model has no knowledge of events after Q3 2024.
4. **Mathematical reasoning:** Complex multi-step calculations may produce errors. Use dedicated tools for precise math.
5. **Low-resource languages:** While 50+ languages are supported, quality varies significantly for lower-resource languages.
6. **Long context degradation:** Performance on tasks requiring reasoning over very long documents (>64K tokens) may degrade.

---

## Responsible AI & Safety

- **RLHF:** The model was fine-tuned using human preference data to align with helpful, harmless, and honest behavior.
- **Red Teaming:** Internal adversarial testing was conducted to identify failure modes in Tajik, Russian, and English.
- **Content Filtering:** The TajikGPT API includes a multi-layer content filtering system that operates independently of the model.
- **Bias:** Training data reflects the diversity of web content and may contain societal biases. Users should apply critical judgment when using outputs for sensitive decisions.
- **Privacy:** The training data was processed with PII (personally identifiable information) removal pipelines.

---

## Model Family

| Model | Context | Max Output | Specialty | Tier |
|---|---|---|---|---|
| **TJ-1.0 Mini** | 128K | 4,096 | Fast & lightweight | Free |
| **TJ-1.0** | 128K | 8,192 | Balanced — general purpose | Free |
| **TJ-1.0 Pro** | 128K | 16,384 | Advanced + Vision | Plus |
| **TJ-1.0 Ultra** | 128K | 32,768 | Top performance | Plus |
| **TJ-Coder** | 131K | 32,768 | Code specialist | Free |
| **TJ-Image 1.0** | — | — | Text-to-Image | Free |

---

## Links

- **Platform:** [tajikgpt.com](https://tajikgpt.com)
- **API Docs:** [tajikgpt.com/docs](https://tajikgpt.com/docs)
- **Python SDK:** [pypi.org/project/tajikgpt](https://pypi.org/project/tajikgpt/)
- **Live Demo:** [HuggingFace Space](https://huggingface.co/spaces/TajikGPT-Team/tajikgpt)
- **Developer:** [SoulLab](https://soullab.space)

---

## Citation

If you use TJ-1.0 in research or build products on top of it, please cite:

```bibtex
@misc{tajikgpt2024tj10,
  title        = {TJ-1.0: A Multilingual Large Language Model with Native Tajik Language Support},
  author       = {SoulLab},
  year         = {2024},
  howpublished = {\url{https://tajikgpt.com}},
  note         = {Proprietary model, available via API at https://tajikgpt.com}
}
```

---

*Built with care for Tajikistan and Central Asia. Developed by [SoulLab](https://soullab.space).*