---
license: apache-2.0
base_model:
  - Qwen/Qwen3.6-27B
tags:
  - qwen3.6
  - gguf
  - tq3_4s
  - turboquant
  - vision
  - multimodal
pipeline_tag: image-text-to-text
language:
  - en
  - zh
  - multilingual
---

# Qwen3.6-27B-TQ3_4S

<img width="400px" src="https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.6/logo.png">

[![Qwen Chat](https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5)](https://chat.qwen.ai)

## TQ3_4S Release

This repository packages the model as a TurboQuant `TQ3_4S` GGUF for local deployment.

## Runtime Compatibility

This quant requires a TurboQuant-capable runtime. For llama.cpp, use the `turbo-tan/llama.cpp-tq3` fork rather than stock upstream llama.cpp if you want native `TQ3_4S` support.

- TurboQuant runtime fork: [turbo-tan/llama.cpp-tq3](https://github.com/turbo-tan/llama.cpp-tq3)
- LM Studio setup: [docs/backend/LMStudio.md](https://github.com/turbo-tan/llama.cpp-tq3/blob/main/docs/backend/LMStudio.md)

## Files

| File | Quant | Size |
| --- | --- | ---: |
| `Qwen3.6-27B-TQ3_4S.gguf` | TQ3_4S | ~13.0 GB |
| `chat_template.jinja` | chat template | text |
| `thumbnail.png` | model card image | png |

## Local Validation

Hardware:

- RTX 5060 Ti 16 GB

Prompt processing:

- `llama-perplexity --chunks 10 -c 2048`
- `PPL = 6.2452 +/- 0.16138`
- `prompt eval = 712.02 tok/s`

16 GB VRAM fit checks on RTX 5060 Ti with the recommended KV settings:

- `32k` context fits
- `64k` context fits
- `128k` context does not fit

## Runtime Notes

- Use a TurboQuant-capable llama.cpp build for best performance.
- For llama.cpp, the intended runtime is the `turbo-tan/llama.cpp-tq3` fork.
- The upstream family is multimodal-capable, but the public 27B repos used here do not currently expose a separate GGUF `mmproj` artifact.
- For llama.cpp chat usage, keep `--jinja` enabled so the bundled chat template is honored.
- Upstream guidance recommends keeping at least `128K` context when possible for reasoning-heavy workloads. On smaller local GPUs, reduce context as needed to fit memory.
- Upstream default sampling guidance differs between thinking and non-thinking mode; follow the official Qwen card if you are trying to reproduce base-model behavior.

## Recommended llama.cpp Settings

Default prompt-processing settings on 16 GB:

```bash
llama-bench \
  -m Qwen3.6-27B-TQ3_4S.gguf \
  -ngl 99 \
  -ctk q4_0 \
  -ctv tq3_0 \
  -fa 1 \
  -p 2048 -n 0 -r 3
```

Default chat/server settings:

```bash
llama-server \
  -m Qwen3.6-27B-TQ3_4S.gguf \
  --host 127.0.0.1 --port 8080 \
  -ngl 99 -c 4096 -np 1 \
  -ctk q4_0 -ctv tq3_0 -fa on \
  --jinja
```

## Example

```bash
llama-cli \
  -m Qwen3.6-27B-TQ3_4S.gguf \
  --jinja \
  -ngl 99 \
  -c 4096
```

Build/runtime:

```bash
git clone https://github.com/turbo-tan/llama.cpp-tq3
```

## Qwen3.6 Base Model

> [!Note]
> The upstream Qwen repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format.
>
> Those upstream artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, and related runtimes.

Following the February release of the Qwen3.5 series, Qwen describes Qwen3.6 as the first open-weight Qwen3.6 variant, built for stronger stability and real-world utility.

### Qwen3.6 Highlights

- **Agentic Coding:** the model handles frontend workflows and repository-level reasoning with greater fluency and precision.
- **Thinking Preservation:** the model family retains reasoning context across historical turns to reduce overhead during iterative work.

![Benchmark Results](https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3.6/Figures/qwen3.6_27b_score.png)

### Model Overview

- Type: Causal Language Model with Vision Encoder
- Training Stage: Pre-training and Post-training
- Architecture: `qwen35`
- Parameters: `27B`
- Layers: `64`
- Embedding dimension: `5120`
- FFN dimension: `17408`
- Hidden layout: `16 × (3 × (Gated DeltaNet -> FFN) -> 1 × (Gated Attention -> FFN))`
- Gated DeltaNet heads: `48` for `V`, `16` for `QK`, head dim `128`
- Gated Attention heads: `24` for `Q`, `4` for `KV`, head dim `256`
- RoPE dim: `64`
- Native context: `262,144`

### Selected Upstream Benchmark Highlights

- `SWE-bench Verified`: `77.2`
- `Terminal-Bench 2.0`: `59.3`
- `SkillsBench Avg5`: `48.2`
- `GPQA Diamond`: `87.8`
- `AIME26`: `94.1`
- `MMMU`: `82.9`
- `AndroidWorld`: `70.3`

## Sources

- Upstream base model: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)
- Upstream GGUF source used for conversion: [unsloth/Qwen3.6-27B-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF)
- Upstream blog and benchmark context: [Qwen3.6-27B model card](https://huggingface.co/Qwen/Qwen3.6-27B)
- TurboQuant runtime fork: [turbo-tan/llama.cpp-tq3](https://github.com/turbo-tan/llama.cpp-tq3)