| --- |
| language: |
| - tg |
| - ru |
| - en |
| license: other |
| license_name: tajikgpt-proprietary |
| license_link: https://tajikgpt.com/license |
| tags: |
| - text-generation |
| - conversational |
| - multilingual |
| - tajik |
| - central-asia |
| - instruction-tuned |
| pipeline_tag: text-generation |
| base_model: [] |
| extra_gated_prompt: >- |
| This model is proprietary and available via API only. |
| --- |
| |
| # TJ-1.0 |
|
|
| **TJ-1.0** is the flagship instruction-tuned language model of the [TajikGPT](https://tajikgpt.com) platform, developed by [SoulLab](https://soullab.space). It is the first commercially deployed large language model with native support for the **Tajik language**, offering a balanced combination of quality, speed, and multilingual capability. |
|
|
| > **Note:** TJ-1.0 is available via API only and is not available for download or local deployment. |
|
|
| --- |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |---|---| |
| | **Developer** | SoulLab | |
| | **Model type** | Instruction-tuned Causal Language Model | |
| | **Architecture** | Decoder-only Transformer with Grouped Query Attention (GQA) | |
| | **Positional Encoding** | Rotary Position Embedding (RoPE) | |
| | **Tokenizer** | Byte-Pair Encoding (BPE), extended vocabulary for Tajik Cyrillic & Latin | |
| | **Fine-tuning** | Supervised Fine-Tuning (SFT) + RLHF (Reinforcement Learning from Human Feedback) | |
| | **Context window** | 128,000 tokens | |
| | **Max output tokens** | 8,192 tokens | |
| | **Knowledge cutoff** | Q3 2024 | |
| | **Languages** | Tajik (tg), Russian (ru), English (en), and 50+ languages | |
| | **License** | Proprietary — [TajikGPT Terms](https://tajikgpt.com/license) | |
| | **Training hardware** | NVIDIA A100 80GB, bf16 precision, PyTorch | |
|
|
| --- |
|
|
| ## Training Data |
|
|
| TJ-1.0 was trained on a curated multilingual corpus with a strong emphasis on Tajik-language content — the first dataset of this scale built specifically for Tajik NLP. |
|
|
| | Source Category | Description | Approx. Share | |
| |---|---|---| |
| | **Tajik Web Corpus** | News, blogs, forums, government portals in Tajik (Cyrillic & Latin) | 28% | |
| | **Tajik Literature & Culture** | Books, poetry, historical texts, folklore | 12% | |
| | **Tajik Legislation** | Laws, decrees, official government documents | 8% | |
| | **Multilingual Web** | High-quality filtered web data (Russian, English, and others) | 32% | |
| | **Instruction & Dialogue** | Human-written and synthetic instruction-following data | 14% | |
| | **Code** | Source code across major programming languages | 6% | |
|
|
| **Total corpus size:** ~2 trillion tokens |
| **Data freshness:** Content up to Q3 2024 |
| **Processing:** Deduplication, quality filtering, language identification, PII removal applied to all sources. |
|
|
| --- |
|
|
| ## Intended Use |
|
|
| ### Recommended Use Cases |
| - Multilingual chat and Q&A in Tajik, Russian, English and 50+ languages |
| - Document summarization and translation |
| - Creative writing, content creation and copywriting |
| - Education, tutoring and homework help |
| - Business communication and professional correspondence |
| - Data analysis, extraction and summarization |
| - Code generation and debugging |
|
|
| ### Out-of-Scope Use Cases |
| - Generation of illegal, harmful, or deceptive content |
| - Medical diagnosis or legal advice without professional oversight |
| - Surveillance or targeting of individuals |
| - Automated high-stakes decision-making without human review |
| - Any use violating the [TajikGPT Terms of Service](https://tajikgpt.com/terms) |
|
|
| --- |
|
|
| ## Evaluation / Benchmarks |
|
|
| All benchmarks were evaluated using standard few-shot settings unless otherwise noted. |
|
|
| ### General Benchmarks |
|
|
| | Benchmark | Score | # Shots | Metric | |
| |---|---|---|---| |
| | **MMLU** (Massive Multitask Language Understanding) | 72.1% | 5-shot | Accuracy | |
| | **MT-Bench** (Multi-turn instruction following) | 7.1 / 10 | 0-shot | GPT-4 Judge | |
| | **HumanEval** (Code generation) | 58.3% | 0-shot | pass@1 | |
| | **HellaSwag** (Commonsense reasoning) | 81.4% | 10-shot | Accuracy | |
|
|
| ### Tajik Language Benchmarks |
|
|
| > These are the first published benchmarks for Tajik-language LLM evaluation. |
|
|
| | Benchmark | Score | Description | |
| |---|---|---| |
| | **TajikQA** | 78.4% | Open-domain Q&A in Tajik language | |
| | **TajikTranslate** | 81.2% BLEU | Tajik ↔ Russian translation | |
| | **TajikInstruct** | 74.6% | Instruction following in Tajik | |
|
|
| --- |
|
|
| ## How to Use |
|
|
| TJ-1.0 is available via the TajikGPT API. Install the SDK or use the REST API directly. |
|
|
| ```bash |
| pip install tajikgpt |
| ``` |
|
|
| ### Python SDK |
|
|
| ```python |
| from tajikgpt import TajikGPT |
| |
| client = TajikGPT(api_key="sk-tj-your-key") |
| |
| response = client.chat.completions.create( |
| model="tj-1.0", |
| messages=[ |
| {"role": "system", "content": "Ты полезный помощник."}, |
| {"role": "user", "content": "Ба забони тоҷикӣ шарҳ деҳ: нейронӣ шабака чист?"} |
| ] |
| ) |
| print(response.choices[0].message.content) |
| ``` |
|
|
| ### REST API |
|
|
| ```bash |
| curl -X POST https://tajikgpt.com/api/tj/chat \ |
| -H "Content-Type: application/json" \ |
| -H "Authorization: Bearer sk-tj-your-key" \ |
| -d '{ |
| "model": "tj-1.0", |
| "messages": [ |
| {"role": "user", "content": "Hello! What can you do?"} |
| ], |
| "max_tokens": 1024, |
| "temperature": 0.7 |
| }' |
| ``` |
|
|
| --- |
|
|
| ## Limitations |
|
|
| 1. **Dialectal Tajik:** The model performs best on standard literary Tajik (Cyrillic). Regional dialects and Latin-script Tajik may show reduced quality. |
| 2. **Hallucinations:** Like all LLMs, TJ-1.0 may generate plausible-sounding but factually incorrect information. Always verify critical facts. |
| 3. **Knowledge cutoff:** The model has no knowledge of events after Q3 2024. |
| 4. **Mathematical reasoning:** Complex multi-step calculations may produce errors. Use dedicated tools for precise math. |
| 5. **Low-resource languages:** While 50+ languages are supported, quality varies significantly for lower-resource languages. |
| 6. **Long context degradation:** Performance on tasks requiring reasoning over very long documents (>64K tokens) may degrade. |
|
|
| --- |
|
|
| ## Responsible AI & Safety |
|
|
| - **RLHF:** The model was fine-tuned using human preference data to align with helpful, harmless, and honest behavior. |
| - **Red Teaming:** Internal adversarial testing was conducted to identify failure modes in Tajik, Russian, and English. |
| - **Content Filtering:** The TajikGPT API includes a multi-layer content filtering system that operates independently of the model. |
| - **Bias:** Training data reflects the diversity of web content and may contain societal biases. Users should apply critical judgment when using outputs for sensitive decisions. |
| - **Privacy:** The training data was processed with PII (personally identifiable information) removal pipelines. |
|
|
| --- |
|
|
| ## Model Family |
|
|
| | Model | Context | Max Output | Specialty | Tier | |
| |---|---|---|---|---| |
| | **TJ-1.0 Mini** | 128K | 4,096 | Fast & lightweight | Free | |
| | **TJ-1.0** | 128K | 8,192 | Balanced — general purpose | Free | |
| | **TJ-1.0 Pro** | 128K | 16,384 | Advanced + Vision | Plus | |
| | **TJ-1.0 Ultra** | 128K | 32,768 | Top performance | Plus | |
| | **TJ-Coder** | 131K | 32,768 | Code specialist | Free | |
| | **TJ-Image 1.0** | — | — | Text-to-Image | Free | |
|
|
| --- |
|
|
| ## Links |
|
|
| - **Platform:** [tajikgpt.com](https://tajikgpt.com) |
| - **API Docs:** [tajikgpt.com/docs](https://tajikgpt.com/docs) |
| - **Python SDK:** [pypi.org/project/tajikgpt](https://pypi.org/project/tajikgpt/) |
| - **Live Demo:** [HuggingFace Space](https://huggingface.co/spaces/TajikGPT-Team/tajikgpt) |
| - **Developer:** [SoulLab](https://soullab.space) |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use TJ-1.0 in research or build products on top of it, please cite: |
|
|
| ```bibtex |
| @misc{tajikgpt2024tj10, |
| title = {TJ-1.0: A Multilingual Large Language Model with Native Tajik Language Support}, |
| author = {SoulLab}, |
| year = {2024}, |
| howpublished = {\url{https://tajikgpt.com}}, |
| note = {Proprietary model, available via API at https://tajikgpt.com} |
| } |
| ``` |
|
|
| --- |
|
|
| *Built with care for Tajikistan and Central Asia. Developed by [SoulLab](https://soullab.space).* |
|
|