tj-1.0 / README.md

Update README.md

5724869 verified 3 months ago

7.94 kB

	---
	language:
	- tg
	- ru
	- en
	license: other
	license_name: tajikgpt-proprietary
	license_link: https://tajikgpt.com/license
	tags:
	- text-generation
	- conversational
	- multilingual
	- tajik
	- central-asia
	- instruction-tuned
	pipeline_tag: text-generation
	base_model: []
	extra_gated_prompt: >-
	This model is proprietary and available via API only.
	---

	# TJ-1.0

	TJ-1.0 is the flagship instruction-tuned language model of the [TajikGPT](https://tajikgpt.com) platform, developed by [SoulLab](https://soullab.space). It is the first commercially deployed large language model with native support for the Tajik language, offering a balanced combination of quality, speed, and multilingual capability.

	> Note: TJ-1.0 is available via API only and is not available for download or local deployment.

	---

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Developer \| SoulLab \|
	\| Model type \| Instruction-tuned Causal Language Model \|
	\| Architecture \| Decoder-only Transformer with Grouped Query Attention (GQA) \|
	\| Positional Encoding \| Rotary Position Embedding (RoPE) \|
	\| Tokenizer \| Byte-Pair Encoding (BPE), extended vocabulary for Tajik Cyrillic & Latin \|
	\| Fine-tuning \| Supervised Fine-Tuning (SFT) + RLHF (Reinforcement Learning from Human Feedback) \|
	\| Context window \| 128,000 tokens \|
	\| Max output tokens \| 8,192 tokens \|
	\| Knowledge cutoff \| Q3 2024 \|
	\| Languages \| Tajik (tg), Russian (ru), English (en), and 50+ languages \|
	\| License \| Proprietary — [TajikGPT Terms](https://tajikgpt.com/license) \|
	\| Training hardware \| NVIDIA A100 80GB, bf16 precision, PyTorch \|

	---

	## Training Data

	TJ-1.0 was trained on a curated multilingual corpus with a strong emphasis on Tajik-language content — the first dataset of this scale built specifically for Tajik NLP.

	\| Source Category \| Description \| Approx. Share \|
	\|---\|---\|---\|
	\| Tajik Web Corpus \| News, blogs, forums, government portals in Tajik (Cyrillic & Latin) \| 28% \|
	\| Tajik Literature & Culture \| Books, poetry, historical texts, folklore \| 12% \|
	\| Tajik Legislation \| Laws, decrees, official government documents \| 8% \|
	\| Multilingual Web \| High-quality filtered web data (Russian, English, and others) \| 32% \|
	\| Instruction & Dialogue \| Human-written and synthetic instruction-following data \| 14% \|
	\| Code \| Source code across major programming languages \| 6% \|

	Total corpus size: ~2 trillion tokens
	Data freshness: Content up to Q3 2024
	Processing: Deduplication, quality filtering, language identification, PII removal applied to all sources.

	---

	## Intended Use

	### Recommended Use Cases
	- Multilingual chat and Q&A in Tajik, Russian, English and 50+ languages
	- Document summarization and translation
	- Creative writing, content creation and copywriting
	- Education, tutoring and homework help
	- Business communication and professional correspondence
	- Data analysis, extraction and summarization
	- Code generation and debugging

	### Out-of-Scope Use Cases
	- Generation of illegal, harmful, or deceptive content
	- Medical diagnosis or legal advice without professional oversight
	- Surveillance or targeting of individuals
	- Automated high-stakes decision-making without human review
	- Any use violating the [TajikGPT Terms of Service](https://tajikgpt.com/terms)

	---

	## Evaluation / Benchmarks

	All benchmarks were evaluated using standard few-shot settings unless otherwise noted.

	### General Benchmarks

	\| Benchmark \| Score \| # Shots \| Metric \|
	\|---\|---\|---\|---\|
	\| MMLU (Massive Multitask Language Understanding) \| 72.1% \| 5-shot \| Accuracy \|
	\| MT-Bench (Multi-turn instruction following) \| 7.1 / 10 \| 0-shot \| GPT-4 Judge \|
	\| HumanEval (Code generation) \| 58.3% \| 0-shot \| pass@1 \|
	\| HellaSwag (Commonsense reasoning) \| 81.4% \| 10-shot \| Accuracy \|

	### Tajik Language Benchmarks

	> These are the first published benchmarks for Tajik-language LLM evaluation.

	\| Benchmark \| Score \| Description \|
	\|---\|---\|---\|
	\| TajikQA \| 78.4% \| Open-domain Q&A in Tajik language \|
	\| TajikTranslate \| 81.2% BLEU \| Tajik ↔ Russian translation \|
	\| TajikInstruct \| 74.6% \| Instruction following in Tajik \|

	---

	## How to Use

	TJ-1.0 is available via the TajikGPT API. Install the SDK or use the REST API directly.

	```bash
	pip install tajikgpt
	```

	### Python SDK

	```python
	from tajikgpt import TajikGPT

	client = TajikGPT(api_key="sk-tj-your-key")

	response = client.chat.completions.create(
	model="tj-1.0",
	messages=[
	{"role": "system", "content": "Ты полезный помощник."},
	{"role": "user", "content": "Ба забони тоҷикӣ шарҳ деҳ: нейронӣ шабака чист?"}
	]
	)
	print(response.choices[0].message.content)
	```

	### REST API

	```bash
	curl -X POST https://tajikgpt.com/api/tj/chat \
	-H "Content-Type: application/json" \
	-H "Authorization: Bearer sk-tj-your-key" \
	-d '{
	"model": "tj-1.0",
	"messages": [
	{"role": "user", "content": "Hello! What can you do?"}
	],
	"max_tokens": 1024,
	"temperature": 0.7
	}'
	```

	---

	## Limitations

	1. Dialectal Tajik: The model performs best on standard literary Tajik (Cyrillic). Regional dialects and Latin-script Tajik may show reduced quality.
	2. Hallucinations: Like all LLMs, TJ-1.0 may generate plausible-sounding but factually incorrect information. Always verify critical facts.
	3. Knowledge cutoff: The model has no knowledge of events after Q3 2024.
	4. Mathematical reasoning: Complex multi-step calculations may produce errors. Use dedicated tools for precise math.
	5. Low-resource languages: While 50+ languages are supported, quality varies significantly for lower-resource languages.
	6. Long context degradation: Performance on tasks requiring reasoning over very long documents (>64K tokens) may degrade.

	---

	## Responsible AI & Safety

	- RLHF: The model was fine-tuned using human preference data to align with helpful, harmless, and honest behavior.
	- Red Teaming: Internal adversarial testing was conducted to identify failure modes in Tajik, Russian, and English.
	- Content Filtering: The TajikGPT API includes a multi-layer content filtering system that operates independently of the model.
	- Bias: Training data reflects the diversity of web content and may contain societal biases. Users should apply critical judgment when using outputs for sensitive decisions.
	- Privacy: The training data was processed with PII (personally identifiable information) removal pipelines.

	---

	## Model Family

	\| Model \| Context \| Max Output \| Specialty \| Tier \|
	\|---\|---\|---\|---\|---\|
	\| TJ-1.0 Mini \| 128K \| 4,096 \| Fast & lightweight \| Free \|
	\| TJ-1.0 \| 128K \| 8,192 \| Balanced — general purpose \| Free \|
	\| TJ-1.0 Pro \| 128K \| 16,384 \| Advanced + Vision \| Plus \|
	\| TJ-1.0 Ultra \| 128K \| 32,768 \| Top performance \| Plus \|
	\| TJ-Coder \| 131K \| 32,768 \| Code specialist \| Free \|
	\| TJ-Image 1.0 \| — \| — \| Text-to-Image \| Free \|

	---

	## Links

	- Platform: [tajikgpt.com](https://tajikgpt.com)
	- API Docs: [tajikgpt.com/docs](https://tajikgpt.com/docs)
	- Python SDK: [pypi.org/project/tajikgpt](https://pypi.org/project/tajikgpt/)
	- Live Demo: [HuggingFace Space](https://huggingface.co/spaces/TajikGPT-Team/tajikgpt)
	- Developer: [SoulLab](https://soullab.space)

	---

	## Citation

	If you use TJ-1.0 in research or build products on top of it, please cite:

	```bibtex
	@misc{tajikgpt2024tj10,
	title = {TJ-1.0: A Multilingual Large Language Model with Native Tajik Language Support},
	author = {SoulLab},
	year = {2024},
	howpublished = {\url{https://tajikgpt.com}},
	note = {Proprietary model, available via API at https://tajikgpt.com}
	}
	```

	---

	Built with care for Tajikistan and Central Asia. Developed by [SoulLab](https://soullab.space).