--- license: apache-2.0 base_model: - Qwen/Qwen3.6-27B tags: - qwen3.6 - gguf - tq3_4s - turboquant - vision - multimodal pipeline_tag: image-text-to-text language: - en - zh - multilingual --- # Qwen3.6-27B-TQ3_4S [![Qwen Chat](https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5)](https://chat.qwen.ai) ## TQ3_4S Release This repository packages the model as a TurboQuant `TQ3_4S` GGUF for local deployment. ## Runtime Compatibility This quant requires a TurboQuant-capable runtime. For llama.cpp, use the `turbo-tan/llama.cpp-tq3` fork rather than stock upstream llama.cpp if you want native `TQ3_4S` support. - TurboQuant runtime fork: [turbo-tan/llama.cpp-tq3](https://github.com/turbo-tan/llama.cpp-tq3) - LM Studio setup: [docs/backend/LMStudio.md](https://github.com/turbo-tan/llama.cpp-tq3/blob/main/docs/backend/LMStudio.md) ## Files | File | Quant | Size | | --- | --- | ---: | | `Qwen3.6-27B-TQ3_4S.gguf` | TQ3_4S | ~13.0 GB | | `chat_template.jinja` | chat template | text | | `thumbnail.png` | model card image | png | ## Local Validation Hardware: - RTX 5060 Ti 16 GB Prompt processing: - `llama-perplexity --chunks 10 -c 2048` - `PPL = 6.2452 +/- 0.16138` - `prompt eval = 712.02 tok/s` 16 GB VRAM fit checks on RTX 5060 Ti with the recommended KV settings: - `32k` context fits - `64k` context fits - `128k` context does not fit ## Runtime Notes - Use a TurboQuant-capable llama.cpp build for best performance. - For llama.cpp, the intended runtime is the `turbo-tan/llama.cpp-tq3` fork. - The upstream family is multimodal-capable, but the public 27B repos used here do not currently expose a separate GGUF `mmproj` artifact. - For llama.cpp chat usage, keep `--jinja` enabled so the bundled chat template is honored. - Upstream guidance recommends keeping at least `128K` context when possible for reasoning-heavy workloads. On smaller local GPUs, reduce context as needed to fit memory. - Upstream default sampling guidance differs between thinking and non-thinking mode; follow the official Qwen card if you are trying to reproduce base-model behavior. ## Recommended llama.cpp Settings Default prompt-processing settings on 16 GB: ```bash llama-bench \ -m Qwen3.6-27B-TQ3_4S.gguf \ -ngl 99 \ -ctk q4_0 \ -ctv tq3_0 \ -fa 1 \ -p 2048 -n 0 -r 3 ``` Default chat/server settings: ```bash llama-server \ -m Qwen3.6-27B-TQ3_4S.gguf \ --host 127.0.0.1 --port 8080 \ -ngl 99 -c 4096 -np 1 \ -ctk q4_0 -ctv tq3_0 -fa on \ --jinja ``` ## Example ```bash llama-cli \ -m Qwen3.6-27B-TQ3_4S.gguf \ --jinja \ -ngl 99 \ -c 4096 ``` Build/runtime: ```bash git clone https://github.com/turbo-tan/llama.cpp-tq3 ``` ## Qwen3.6 Base Model > [!Note] > The upstream Qwen repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. > > Those upstream artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, and related runtimes. Following the February release of the Qwen3.5 series, Qwen describes Qwen3.6 as the first open-weight Qwen3.6 variant, built for stronger stability and real-world utility. ### Qwen3.6 Highlights - **Agentic Coding:** the model handles frontend workflows and repository-level reasoning with greater fluency and precision. - **Thinking Preservation:** the model family retains reasoning context across historical turns to reduce overhead during iterative work. ![Benchmark Results](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3.6/Figures/qwen3.6_27b_score.png) ### Model Overview - Type: Causal Language Model with Vision Encoder - Training Stage: Pre-training and Post-training - Architecture: `qwen35` - Parameters: `27B` - Layers: `64` - Embedding dimension: `5120` - FFN dimension: `17408` - Hidden layout: `16 × (3 × (Gated DeltaNet -> FFN) -> 1 × (Gated Attention -> FFN))` - Gated DeltaNet heads: `48` for `V`, `16` for `QK`, head dim `128` - Gated Attention heads: `24` for `Q`, `4` for `KV`, head dim `256` - RoPE dim: `64` - Native context: `262,144` ### Selected Upstream Benchmark Highlights - `SWE-bench Verified`: `77.2` - `Terminal-Bench 2.0`: `59.3` - `SkillsBench Avg5`: `48.2` - `GPQA Diamond`: `87.8` - `AIME26`: `94.1` - `MMMU`: `82.9` - `AndroidWorld`: `70.3` ## Sources - Upstream base model: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) - Upstream GGUF source used for conversion: [unsloth/Qwen3.6-27B-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) - Upstream blog and benchmark context: [Qwen3.6-27B model card](https://huggingface.co/Qwen/Qwen3.6-27B) - TurboQuant runtime fork: [turbo-tan/llama.cpp-tq3](https://github.com/turbo-tan/llama.cpp-tq3)