tj-1.0 / README.md
soul-muxammad's picture
Update README.md
5724869 verified
|
Raw
History Blame Contribute Delete
7.94 kB
---
language:
- tg
- ru
- en
license: other
license_name: tajikgpt-proprietary
license_link: https://tajikgpt.com/license
tags:
- text-generation
- conversational
- multilingual
- tajik
- central-asia
- instruction-tuned
pipeline_tag: text-generation
base_model: []
extra_gated_prompt: >-
This model is proprietary and available via API only.
---
# TJ-1.0
**TJ-1.0** is the flagship instruction-tuned language model of the [TajikGPT](https://tajikgpt.com) platform, developed by [SoulLab](https://soullab.space). It is the first commercially deployed large language model with native support for the **Tajik language**, offering a balanced combination of quality, speed, and multilingual capability.
> **Note:** TJ-1.0 is available via API only and is not available for download or local deployment.
---
## Model Details
| Property | Value |
|---|---|
| **Developer** | SoulLab |
| **Model type** | Instruction-tuned Causal Language Model |
| **Architecture** | Decoder-only Transformer with Grouped Query Attention (GQA) |
| **Positional Encoding** | Rotary Position Embedding (RoPE) |
| **Tokenizer** | Byte-Pair Encoding (BPE), extended vocabulary for Tajik Cyrillic & Latin |
| **Fine-tuning** | Supervised Fine-Tuning (SFT) + RLHF (Reinforcement Learning from Human Feedback) |
| **Context window** | 128,000 tokens |
| **Max output tokens** | 8,192 tokens |
| **Knowledge cutoff** | Q3 2024 |
| **Languages** | Tajik (tg), Russian (ru), English (en), and 50+ languages |
| **License** | Proprietary — [TajikGPT Terms](https://tajikgpt.com/license) |
| **Training hardware** | NVIDIA A100 80GB, bf16 precision, PyTorch |
---
## Training Data
TJ-1.0 was trained on a curated multilingual corpus with a strong emphasis on Tajik-language content — the first dataset of this scale built specifically for Tajik NLP.
| Source Category | Description | Approx. Share |
|---|---|---|
| **Tajik Web Corpus** | News, blogs, forums, government portals in Tajik (Cyrillic & Latin) | 28% |
| **Tajik Literature & Culture** | Books, poetry, historical texts, folklore | 12% |
| **Tajik Legislation** | Laws, decrees, official government documents | 8% |
| **Multilingual Web** | High-quality filtered web data (Russian, English, and others) | 32% |
| **Instruction & Dialogue** | Human-written and synthetic instruction-following data | 14% |
| **Code** | Source code across major programming languages | 6% |
**Total corpus size:** ~2 trillion tokens
**Data freshness:** Content up to Q3 2024
**Processing:** Deduplication, quality filtering, language identification, PII removal applied to all sources.
---
## Intended Use
### Recommended Use Cases
- Multilingual chat and Q&A in Tajik, Russian, English and 50+ languages
- Document summarization and translation
- Creative writing, content creation and copywriting
- Education, tutoring and homework help
- Business communication and professional correspondence
- Data analysis, extraction and summarization
- Code generation and debugging
### Out-of-Scope Use Cases
- Generation of illegal, harmful, or deceptive content
- Medical diagnosis or legal advice without professional oversight
- Surveillance or targeting of individuals
- Automated high-stakes decision-making without human review
- Any use violating the [TajikGPT Terms of Service](https://tajikgpt.com/terms)
---
## Evaluation / Benchmarks
All benchmarks were evaluated using standard few-shot settings unless otherwise noted.
### General Benchmarks
| Benchmark | Score | # Shots | Metric |
|---|---|---|---|
| **MMLU** (Massive Multitask Language Understanding) | 72.1% | 5-shot | Accuracy |
| **MT-Bench** (Multi-turn instruction following) | 7.1 / 10 | 0-shot | GPT-4 Judge |
| **HumanEval** (Code generation) | 58.3% | 0-shot | pass@1 |
| **HellaSwag** (Commonsense reasoning) | 81.4% | 10-shot | Accuracy |
### Tajik Language Benchmarks
> These are the first published benchmarks for Tajik-language LLM evaluation.
| Benchmark | Score | Description |
|---|---|---|
| **TajikQA** | 78.4% | Open-domain Q&A in Tajik language |
| **TajikTranslate** | 81.2% BLEU | Tajik ↔ Russian translation |
| **TajikInstruct** | 74.6% | Instruction following in Tajik |
---
## How to Use
TJ-1.0 is available via the TajikGPT API. Install the SDK or use the REST API directly.
```bash
pip install tajikgpt
```
### Python SDK
```python
from tajikgpt import TajikGPT
client = TajikGPT(api_key="sk-tj-your-key")
response = client.chat.completions.create(
model="tj-1.0",
messages=[
{"role": "system", "content": "Ты полезный помощник."},
{"role": "user", "content": "Ба забони тоҷикӣ шарҳ деҳ: нейронӣ шабака чист?"}
]
)
print(response.choices[0].message.content)
```
### REST API
```bash
curl -X POST https://tajikgpt.com/api/tj/chat \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-tj-your-key" \
-d '{
"model": "tj-1.0",
"messages": [
{"role": "user", "content": "Hello! What can you do?"}
],
"max_tokens": 1024,
"temperature": 0.7
}'
```
---
## Limitations
1. **Dialectal Tajik:** The model performs best on standard literary Tajik (Cyrillic). Regional dialects and Latin-script Tajik may show reduced quality.
2. **Hallucinations:** Like all LLMs, TJ-1.0 may generate plausible-sounding but factually incorrect information. Always verify critical facts.
3. **Knowledge cutoff:** The model has no knowledge of events after Q3 2024.
4. **Mathematical reasoning:** Complex multi-step calculations may produce errors. Use dedicated tools for precise math.
5. **Low-resource languages:** While 50+ languages are supported, quality varies significantly for lower-resource languages.
6. **Long context degradation:** Performance on tasks requiring reasoning over very long documents (>64K tokens) may degrade.
---
## Responsible AI & Safety
- **RLHF:** The model was fine-tuned using human preference data to align with helpful, harmless, and honest behavior.
- **Red Teaming:** Internal adversarial testing was conducted to identify failure modes in Tajik, Russian, and English.
- **Content Filtering:** The TajikGPT API includes a multi-layer content filtering system that operates independently of the model.
- **Bias:** Training data reflects the diversity of web content and may contain societal biases. Users should apply critical judgment when using outputs for sensitive decisions.
- **Privacy:** The training data was processed with PII (personally identifiable information) removal pipelines.
---
## Model Family
| Model | Context | Max Output | Specialty | Tier |
|---|---|---|---|---|
| **TJ-1.0 Mini** | 128K | 4,096 | Fast & lightweight | Free |
| **TJ-1.0** | 128K | 8,192 | Balanced — general purpose | Free |
| **TJ-1.0 Pro** | 128K | 16,384 | Advanced + Vision | Plus |
| **TJ-1.0 Ultra** | 128K | 32,768 | Top performance | Plus |
| **TJ-Coder** | 131K | 32,768 | Code specialist | Free |
| **TJ-Image 1.0** | — | — | Text-to-Image | Free |
---
## Links
- **Platform:** [tajikgpt.com](https://tajikgpt.com)
- **API Docs:** [tajikgpt.com/docs](https://tajikgpt.com/docs)
- **Python SDK:** [pypi.org/project/tajikgpt](https://pypi.org/project/tajikgpt/)
- **Live Demo:** [HuggingFace Space](https://huggingface.co/spaces/TajikGPT-Team/tajikgpt)
- **Developer:** [SoulLab](https://soullab.space)
---
## Citation
If you use TJ-1.0 in research or build products on top of it, please cite:
```bibtex
@misc{tajikgpt2024tj10,
title = {TJ-1.0: A Multilingual Large Language Model with Native Tajik Language Support},
author = {SoulLab},
year = {2024},
howpublished = {\url{https://tajikgpt.com}},
note = {Proprietary model, available via API at https://tajikgpt.com}
}
```
---
*Built with care for Tajikistan and Central Asia. Developed by [SoulLab](https://soullab.space).*