--- license: apache-2.0 language: - tr - en base_model: Qwen/Qwen3.5-4B tags: - turkish - experimental - fine-tuned - qwen - conversational - text-generation pipeline_tag: text-generation --- # Jazari-4B-SFT-TR (Experimental) A 4B parameter Turkish-adapted LLM based on Qwen3.5-4B. This is an **experimental** model created as a low-budget Turkish language adaptation case study. > **Note:** This is a research/experimental release. It is NOT a production-ready model. See limitations below. ## What is this? Jazari-4B-SFT-TR is a Qwen3.5-4B model adapted for Turkish through: 1. **Continued Pre-Training (CPT):** 674 MB of Turkish text, 11,939 steps 2. **Supervised Fine-Tuning (SFT):** 73,182 examples across 13 categories, 9,600 steps (early stopping) The entire training was done on a single RTX 5090 GPU (vast.ai) over approximately 1 week, using LoRA (r=128 for CPT, r=64 for SFT) with the Unsloth + TRL framework. ## Training Details | Parameter | Value | |-----------|-------| | Base Model | Qwen/Qwen3.5-4B | | CPT Data | 674 MB Turkish text (AkademikDerlem, BellaTurca, OzenliDerlem, dolphin-r1-turkish, gsm8k_tr) | | SFT Data | 73,182 examples (chat, reasoning, tool-calling, code, knowledge, helpsteer, etc.) | | CPT Steps | 11,939 (1 epoch, train loss: 0.48) | | SFT Steps | 9,600 (early stopping at epoch 1.05, best eval loss: 0.992) | | LoRA Rank | CPT: r=128, SFT: r=64 (rsLoRA) | | Precision | bf16 | | Framework | Unsloth 2026.3.x + TRL 0.29.1 | | GPU | NVIDIA RTX 5090 (32 GB VRAM) | | Total Training Time | ~7 days | | Total Cost | ~$25 (vast.ai) | ## Evaluation Results ### Custom Turkish Benchmark (360 multiple-choice questions) > **Important:** This benchmark uses handcrafted questions inspired by global benchmarks (MMLU, ARC, GSM8K, TruthfulQA, HellaSwag, Winogrande). It is NOT an official benchmark and scores are not directly comparable to published leaderboard results. Questions tend to be easier than official benchmarks. | Benchmark | Description | Questions | jazari-4b-sft | qwen3.5:4b (base) | |-----------|-------------|:---------:|:-------------:|:------------------:| | TR-MMLU | Turkish knowledge (12 subjects) | 120 | 85.0% | 80.0% | | TR-ARC | Turkish science (elementary) | 50 | 98.0% | 84.0% | | TR-GSM8K | Turkish math word problems | 50 | 52.0% | 66.0% | | TR-TruthfulQA | Misconception/truthfulness | 40 | 95.0% | 52.5% | | TR-HellaSwag | Sentence completion | 40 | 97.5% | 47.5% | | TR-Winogrande | Coreference resolution | 30 | 66.7% | 43.3% | | TR-Cultural | Turkish culture & idioms | 30 | 76.7% | 33.3% | | **Overall** | | **360** | **82.5%** | **65.0%** | *Note: qwen3.5:4b scores may be affected by think-mode output parsing issues (121/360 answers could not be extracted). These results should be interpreted with caution.* ### Real-World Evaluation (15 scenarios, scored by independent evaluator) | Model | Accuracy | Turkish Quality | Helpfulness | Overall | |-------|:--------:|:---------------:|:-----------:|:-------:| | jazari-4b-sft | 5.20/10 | 7.33/10 | 5.00/10 | 5.84/10 | | qwen3.5:4b | 5.00/10 | 7.00/10 | 4.53/10 | 5.51/10 | *jazari won 6/15 scenarios, qwen won 3/15, 6 ties.* ## Strengths - **Turkish fluency:** More natural Turkish responses compared to base Qwen3.5-4B - **Cultural knowledge:** Better understanding of Turkish idioms, proverbs, and cultural references - **Conversational ability:** Stable, helpful responses in casual Turkish conversation - **Speed:** ~5x faster inference than base model (no think-mode overhead) - **Low cost:** Entire training cost ~$25 on vast.ai ## Known Limitations & Failure Modes **Be aware of these issues before using this model:** 1. **Overconfidence:** The model rarely says "I don't know." It may generate plausible-sounding but incorrect answers. 2. **Format discipline:** Struggles with strict output formats. When asked for JSON, may return plain text instead. 3. **Translation:** Cannot reliably translate Turkish to English (may echo the Turkish input). 4. **Grammar correction:** May not detect obvious spelling errors (e.g., "Herkezin" vs "Herkesin"). 5. **Math word problems:** Weaker than base model on TR-GSM8K (52% vs 66%). 6. **Instruction following:** May add extra text when asked for short answers (e.g., "Just say Yes or No"). 7. **Hallucination:** May fabricate plausible-sounding facts, especially about specific places, people, or events. 8. **Limited CPT data:** Only 674 MB of Turkish text was used for pre-training, which limits deep domain knowledge. ### Example failures: ``` Q: "Bu cümledeki yazım yanlışını bul: 'Herkezin bir fikri var.'" A: "Cümlede ciddi bir sorun yok." # WRONG - should detect "Herkezin" → "Herkesin" Q: "Bu cümleyi İngilizce'ye çevir: 'Türkiye güzel bir ülkedir.'" A: "Türkiye güzel bir ülkedir." # WRONG - echoed Turkish instead of translating Q: "Şu bilgileri JSON formatında ver: isim=Ali, yaş=25" A: "Ali, 25 yaşındadır." # WRONG - should have returned JSON ``` ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("mahsum/jazari-4b-sft-tr") tokenizer = AutoTokenizer.from_pretrained("mahsum/jazari-4b-sft-tr") messages = [ {"role": "user", "content": "Türkiye'nin en güzel şehri neresidir?"} ] inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt") output = model.generate(inputs, max_new_tokens=256, temperature=0.7, top_p=0.9, repetition_penalty=1.3) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ### GGUF / Ollama ```bash # If GGUF version is available ollama run mahsum/jazari-4b-sft-tr ``` ## What this model is good for - Turkish conversational AI experiments - Studying low-budget Turkish LLM adaptation techniques - Baseline for further Turkish fine-tuning - Learning about CPT + SFT training pipelines ## What this model is NOT good for - Production deployment without additional safety measures - Reliable factual question answering - Tasks requiring strict format compliance (JSON, structured output) - Translation tasks - Any high-stakes application ## Next Steps (v2 Roadmap) - [ ] Full CPT with 26 GB cleaned Turkish text (currently only 674 MB used) - [ ] Alignment training (SimPO/GRPO) - [ ] Improved format discipline and instruction following - [ ] Uncertainty calibration ("I don't know" training) - [ ] Official Cetvel benchmark evaluation - [ ] GGUF quantized versions for edge deployment ## Citation ```bibtex @misc{aktas2026jazari, title={Jazari-4B-SFT-TR: Low-Budget Turkish Adaptation of Qwen3.5-4B}, author={Aktas, Mahsum}, year={2026}, url={https://huggingface.co/mahsum/jazari-4b-sft-tr} } ``` ## Acknowledgments - [Qwen Team](https://huggingface.co/Qwen) for the base model - [Unsloth](https://github.com/unslothai/unsloth) for training optimizations - [vast.ai](https://vast.ai) for affordable GPU compute - Turkish NLP community for open datasets --- *Jazari is named after Al-Jazari (1136-1206), the Turkish-Muslim polymath and engineer, often regarded as the father of robotics.*