--- language: - tg - ru - en license: other license_name: tajikgpt-proprietary license_link: https://tajikgpt.com/license tags: - text-generation - conversational - multilingual - tajik - central-asia - instruction-tuned pipeline_tag: text-generation base_model: [] extra_gated_prompt: >- This model is proprietary and available via API only. --- # TJ-1.0 **TJ-1.0** is the flagship instruction-tuned language model of the [TajikGPT](https://tajikgpt.com) platform, developed by [SoulLab](https://soullab.space). It is the first commercially deployed large language model with native support for the **Tajik language**, offering a balanced combination of quality, speed, and multilingual capability. > **Note:** TJ-1.0 is available via API only and is not available for download or local deployment. --- ## Model Details | Property | Value | |---|---| | **Developer** | SoulLab | | **Model type** | Instruction-tuned Causal Language Model | | **Architecture** | Decoder-only Transformer with Grouped Query Attention (GQA) | | **Positional Encoding** | Rotary Position Embedding (RoPE) | | **Tokenizer** | Byte-Pair Encoding (BPE), extended vocabulary for Tajik Cyrillic & Latin | | **Fine-tuning** | Supervised Fine-Tuning (SFT) + RLHF (Reinforcement Learning from Human Feedback) | | **Context window** | 128,000 tokens | | **Max output tokens** | 8,192 tokens | | **Knowledge cutoff** | Q3 2024 | | **Languages** | Tajik (tg), Russian (ru), English (en), and 50+ languages | | **License** | Proprietary — [TajikGPT Terms](https://tajikgpt.com/license) | | **Training hardware** | NVIDIA A100 80GB, bf16 precision, PyTorch | --- ## Training Data TJ-1.0 was trained on a curated multilingual corpus with a strong emphasis on Tajik-language content — the first dataset of this scale built specifically for Tajik NLP. | Source Category | Description | Approx. Share | |---|---|---| | **Tajik Web Corpus** | News, blogs, forums, government portals in Tajik (Cyrillic & Latin) | 28% | | **Tajik Literature & Culture** | Books, poetry, historical texts, folklore | 12% | | **Tajik Legislation** | Laws, decrees, official government documents | 8% | | **Multilingual Web** | High-quality filtered web data (Russian, English, and others) | 32% | | **Instruction & Dialogue** | Human-written and synthetic instruction-following data | 14% | | **Code** | Source code across major programming languages | 6% | **Total corpus size:** ~2 trillion tokens **Data freshness:** Content up to Q3 2024 **Processing:** Deduplication, quality filtering, language identification, PII removal applied to all sources. --- ## Intended Use ### Recommended Use Cases - Multilingual chat and Q&A in Tajik, Russian, English and 50+ languages - Document summarization and translation - Creative writing, content creation and copywriting - Education, tutoring and homework help - Business communication and professional correspondence - Data analysis, extraction and summarization - Code generation and debugging ### Out-of-Scope Use Cases - Generation of illegal, harmful, or deceptive content - Medical diagnosis or legal advice without professional oversight - Surveillance or targeting of individuals - Automated high-stakes decision-making without human review - Any use violating the [TajikGPT Terms of Service](https://tajikgpt.com/terms) --- ## Evaluation / Benchmarks All benchmarks were evaluated using standard few-shot settings unless otherwise noted. ### General Benchmarks | Benchmark | Score | # Shots | Metric | |---|---|---|---| | **MMLU** (Massive Multitask Language Understanding) | 72.1% | 5-shot | Accuracy | | **MT-Bench** (Multi-turn instruction following) | 7.1 / 10 | 0-shot | GPT-4 Judge | | **HumanEval** (Code generation) | 58.3% | 0-shot | pass@1 | | **HellaSwag** (Commonsense reasoning) | 81.4% | 10-shot | Accuracy | ### Tajik Language Benchmarks > These are the first published benchmarks for Tajik-language LLM evaluation. | Benchmark | Score | Description | |---|---|---| | **TajikQA** | 78.4% | Open-domain Q&A in Tajik language | | **TajikTranslate** | 81.2% BLEU | Tajik ↔ Russian translation | | **TajikInstruct** | 74.6% | Instruction following in Tajik | --- ## How to Use TJ-1.0 is available via the TajikGPT API. Install the SDK or use the REST API directly. ```bash pip install tajikgpt ``` ### Python SDK ```python from tajikgpt import TajikGPT client = TajikGPT(api_key="sk-tj-your-key") response = client.chat.completions.create( model="tj-1.0", messages=[ {"role": "system", "content": "Ты полезный помощник."}, {"role": "user", "content": "Ба забони тоҷикӣ шарҳ деҳ: нейронӣ шабака чист?"} ] ) print(response.choices[0].message.content) ``` ### REST API ```bash curl -X POST https://tajikgpt.com/api/tj/chat \ -H "Content-Type: application/json" \ -H "Authorization: Bearer sk-tj-your-key" \ -d '{ "model": "tj-1.0", "messages": [ {"role": "user", "content": "Hello! What can you do?"} ], "max_tokens": 1024, "temperature": 0.7 }' ``` --- ## Limitations 1. **Dialectal Tajik:** The model performs best on standard literary Tajik (Cyrillic). Regional dialects and Latin-script Tajik may show reduced quality. 2. **Hallucinations:** Like all LLMs, TJ-1.0 may generate plausible-sounding but factually incorrect information. Always verify critical facts. 3. **Knowledge cutoff:** The model has no knowledge of events after Q3 2024. 4. **Mathematical reasoning:** Complex multi-step calculations may produce errors. Use dedicated tools for precise math. 5. **Low-resource languages:** While 50+ languages are supported, quality varies significantly for lower-resource languages. 6. **Long context degradation:** Performance on tasks requiring reasoning over very long documents (>64K tokens) may degrade. --- ## Responsible AI & Safety - **RLHF:** The model was fine-tuned using human preference data to align with helpful, harmless, and honest behavior. - **Red Teaming:** Internal adversarial testing was conducted to identify failure modes in Tajik, Russian, and English. - **Content Filtering:** The TajikGPT API includes a multi-layer content filtering system that operates independently of the model. - **Bias:** Training data reflects the diversity of web content and may contain societal biases. Users should apply critical judgment when using outputs for sensitive decisions. - **Privacy:** The training data was processed with PII (personally identifiable information) removal pipelines. --- ## Model Family | Model | Context | Max Output | Specialty | Tier | |---|---|---|---|---| | **TJ-1.0 Mini** | 128K | 4,096 | Fast & lightweight | Free | | **TJ-1.0** | 128K | 8,192 | Balanced — general purpose | Free | | **TJ-1.0 Pro** | 128K | 16,384 | Advanced + Vision | Plus | | **TJ-1.0 Ultra** | 128K | 32,768 | Top performance | Plus | | **TJ-Coder** | 131K | 32,768 | Code specialist | Free | | **TJ-Image 1.0** | — | — | Text-to-Image | Free | --- ## Links - **Platform:** [tajikgpt.com](https://tajikgpt.com) - **API Docs:** [tajikgpt.com/docs](https://tajikgpt.com/docs) - **Python SDK:** [pypi.org/project/tajikgpt](https://pypi.org/project/tajikgpt/) - **Live Demo:** [HuggingFace Space](https://huggingface.co/spaces/TajikGPT-Team/tajikgpt) - **Developer:** [SoulLab](https://soullab.space) --- ## Citation If you use TJ-1.0 in research or build products on top of it, please cite: ```bibtex @misc{tajikgpt2024tj10, title = {TJ-1.0: A Multilingual Large Language Model with Native Tajik Language Support}, author = {SoulLab}, year = {2024}, howpublished = {\url{https://tajikgpt.com}}, note = {Proprietary model, available via API at https://tajikgpt.com} } ``` --- *Built with care for Tajikistan and Central Asia. Developed by [SoulLab](https://soullab.space).*