Qwen3.6-27B-GGUF-5.076bpw

This is a 5.076 BPW quantized model for the GPU poors with 24 GiB of VRAM. It uses the SOTA IQK quants, and thus works in ik_llama.cpp only.

From local testing with llama-perplexity (wiki.test.raw, 580 chunks), it has the best quality and speed in the same size class:

quant this bartowski Q4_K_L unsloth UD-Q4_K_XL mradermacher i1.Q4_K_M ubergarm IQ5_KS
Size (BPW) 5.076 5.493 5.235 4.919 5.919
Size (GiB) 15.893 17.198 16.393 15.401 18.532
VRAM usage (GiB) 14.931 15.940 15.727 14.735 17.570
Mean PPL(Q) 6.982381 ± 0.046281 6.992025 ± 0.046419 6.970005 ± 0.046042 6.938483 ± 0.045668 6.931115 ± 0.045750
Mean PPL(base) 6.908506 ± 0.045543 6.908506 ± 0.045543 6.908506 ± 0.045543 6.908506 ± 0.045543 6.908506 ± 0.045543
Cor(ln(PPL(Q)), ln(PPL(base))) 99.47% 99.40% 99.45% 99.32% 99.79%
Mean KLD 0.019613 ± 0.000587 0.020410 ± 0.000637 0.020399 ± 0.000643 0.024690 ± 0.000686 0.008430 ± 0.000466
Maximum KLD 22.204016 21.812332 20.409454 20.942572 21.321548
99.9% KLD 1.703208 2.161336 2.067440 2.667204 0.698605
RMS Δp 3.843 ± 0.058 % 3.812 ± 0.059 % 3.778 ± 0.058 % 4.264 ± 0.062 % 2.467 ± 0.061 %
Same top p 94.767 ± 0.058 % 94.824 ± 0.058 % 94.824 ± 0.058 % 94.203 ± 0.061 % 96.618 ± 0.047 %

With 24 GiB of VRAM, we can fit a context size of 128000 with F16 KV cache:

-c 128000 -wgt 1

or a context size of 262144 with quantized KV cache:

-c 262144 -wgt 1 -ctk q8_0 -khad -ctv q6_0 -vhad

Size

Size from llama-server output:

llm_load_print_meta: model size       = 15.893 GiB (5.076 BPW)
llm_load_print_meta: repeating layers = 13.969 GiB (4.927 BPW, 24.353 B parameters)
...
llm_load_tensors:  CUDA_Host buffer size =   985.16 MiB
llm_load_tensors:      CUDA0 buffer size = 15288.91 MiB

This is sligtly bigger than the Qwen3.5-27B-4.915bpw quant, due to these changes:

  • attention: mixture of IQ6_K and IQ5_K => Q6_0
  • token_embd: IQ4_K => Q6_0
  • output: IQ6_K => Q6_0

The recipe is almost identical with the IQ4_KS + Q6_0 recipe shared by IK in https://github.com/ikawrakow/ik_llama.cpp/discussions/1663, with ssm_alpha and ssm_beta getting a slight bump from Q6_0 to Q8_0.

IQ4_KS + Q6_0 form a good and fast combo, as noted by IK in https://github.com/ikawrakow/ik_llama.cpp/discussions/1663:

  • IQ4_KS is better than IQ4_XS, has the same size, has the same performance on CUDA and CPU
  • IQ5_K will wipe the floor with Q5_K in terms of quantization accuracy at the same bpw. One issue is that IQ5_K PP is lower on CUDA because of the block size of 16. It is about on par for TG on CUDA, about on par for TG on the CPU, and I think slightly faster PP on the CPU. If one does not want to take the CUDA performance penalty, one could replace Q5_K with IQ5_KS. This will be strictly faster, will use 0.25 bpw less than Q5_K, and will have about the same quantization accuracy.
  • I think in may cases one can replace Q6_K with Q6_0, which is quite a bit faster on CUDA while giving about the same quantization accuracy as Q6_K. IQ6_K is better, but slower.
  • The situation for Q4_K vs IQ4_KS and IQ4_K is similar to Q5_K vs IQ5_KS and IQ5_K
Recipe
blk\..*\.attn_q\.weight=q6_0
blk\..*\.attn_k\.weight=q6_0
blk\..*\.attn_v\.weight=q6_0
blk\..*\.attn_output\.weight=q6_0
blk\..*\.attn_gate\.weight=q6_0
blk\..*\.attn_qkv\.weight=q6_0

blk\..*\.ssm_alpha\.weight=q8_0
blk\..*\.ssm_beta\.weight=q8_0
blk\..*\.ssm_out\.weight=q6_0

blk\..*\.ffn_down\.weight=iq4_ks
blk\..*\.ffn_(gate|up)\.weight=iq4_ks

token_embd\.weight=q6_0
output\.weight=q6_0

Speed

llama-sweep-bench result with a RTX 3090, with flags -ngl 99 -mqkv -muge -cuda graphs=1 -c 128000 -wgt 1 -wb:

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 0.350 1461.47 3.088 41.45
512 128 10240 0.389 1315.93 3.230 39.62
512 128 20480 0.433 1183.72 3.311 38.65
512 128 30720 0.474 1080.64 3.413 37.50
512 128 40960 0.517 990.04 3.532 36.24
512 128 51200 0.560 914.01 3.636 35.20
512 128 61440 0.604 847.72 3.725 34.36
512 128 71680 0.647 791.61 3.826 33.46
512 128 81920 0.691 741.16 3.945 32.44
512 128 92160 0.735 696.16 4.053 31.58
512 128 102400 0.779 657.14 4.143 30.89
512 128 112640 0.823 622.27 4.261 30.04
512 128 122880 0.866 591.36 4.391 29.15

Performance

This quant uses the imatrix from mradermacher. There are some long reasoning tasks that the full precision model served from https://dashscope-intl.aliyuncs.com/compatible-mode/v1 can solve at about 50:50 chance, and this quant can also solve at about 50:50 chance when using the imatrix from mradermacher. Without using any imatrix, this quant can't solve the tasks at all. This finding vindicates the importance of the importance matrix.

On agentic tasks tested with pi agent, the tasks that can be reliably solved by the full precision model can also be reliably solved by this quant.

For mainline users with 24 GiB of VRAM, I would recommend i1-Q4_K_M from mradermacher, which also performs quite well from limited testing.

For IK users that need even higher quality, I would recommend IQ5_KS from ubergarm, which is near lossless.

Verdict

We get Sonnet 4.5 at home with a used RTX 3090.

Downloads last month
36
GGUF
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sokann/Qwen3.6-27B-GGUF-5.076bpw

Base model

Qwen/Qwen3.6-27B
Quantized
(537)
this model