Qwen3.6-27B-GGUF-4.256bpw

This is a 4.256 BPW quantized model for the GPU poors with 16 GiB of VRAM. It works in both ik_llama.cpp and mainline llama.cpp.

It was quantized using the simplest ever recipe – Q8_0 for the tiny ssm_alpha and ssm_beta tensors, IQ4_XS for the rest.

From local testing with llama-perplexity (wiki.test.raw, 580 chunks), it has the best quality and speed in the same size class:

quant this bartowski Q3_K_M unsloth UD-Q3_K_XL mradermacher i1.IQ4_XS bartowski IQ4_XS unsloth IQ4_XS
Size (BPW) 4.256 4.270 4.302 4.483 4.556 4.589
Size (GiB) 13.327 13.370 13.469 14.036 14.266 14.369
VRAM usage (GiB) 12.698 12.861 12.803 13.407 13.637 13.703
Mean PPL(Q) 7.098696 ± 0.047344 6.993009 ± 0.046208 6.995519 ± 0.046227 7.020660 ± 0.046587 6.996323 ± 0.046332 6.950126 ± 0.045846
Mean PPL(base) 6.908506 ± 0.045543 6.908506 ± 0.045543 6.908506 ± 0.045543 6.908506 ± 0.045543 6.908506 ± 0.045543 6.908506 ± 0.045543
Cor(ln(PPL(Q)), ln(PPL(base))) 99.19% 98.52% 98.82% 99.30% 99.32% 99.38%
Mean KLD 0.033452 ± 0.000723 0.058818 ± 0.000881 0.046348 ± 0.000841 0.027289 ± 0.000660 0.026270 ± 0.000653 0.024728 ± 0.000603
Maximum KLD 23.255085 24.616274 24.175169 18.568180 22.992002 21.687405
99.9% KLD 2.907350 3.986622 3.614290 2.667850 2.385293 2.201674
RMS Δp 4.936 ± 0.054 % 6.690 ± 0.059 % 5.867 ± 0.060 % 4.449 ± 0.057 % 4.352 ± 0.057 % 4.264 ± 0.056 %
Same top p 92.427 ± 0.069 % 90.350 ± 0.077 % 91.829 ± 0.071 % 93.903 ± 0.062 % 93.888 ± 0.062 % 93.997 ± 0.062 %
  • Compared to Q3_K_M from bartowski and UD-Q3_K_XL from unsloth, this IQ4_XS quant uses slightly less VRAM while having better quality.
  • The IQ4_XS quant from mradermacher, bartowski, and unsloth have much better quality, but they use more VRAM and are harder to fit into 16 GiB of VRAM.

With 16 GiB of VRAM, we can fit a context size of 65536 with quantized KV cache:

# mainline llama.cpp
-c 65536 -ctk q8_0 -ctv q8_0 -np 1

For brave souls that seek the TurboQuant experience (see #21038), we can also fit a context size of 128000 with more heavily quantized KV cache:

# mainline llama.cpp
-c 128000 -ctk q4_0 -ctv q4_0 -np 1

Size

Size from llama-server output:

llm_load_print_meta: model size       = 13.327 GiB (4.256 BPW)
llm_load_print_meta: repeating layers = 12.069 GiB (4.257 BPW, 24.353 B parameters)
...
llm_load_tensors:  CUDA_Host buffer size =   644.14 MiB
llm_load_tensors:      CUDA0 buffer size = 13003.14 MiB
Recipe
blk\..*\.attn_q\.weight=iq4_xs
blk\..*\.attn_k\.weight=iq4_xs
blk\..*\.attn_v\.weight=iq4_xs
blk\..*\.attn_output\.weight=iq4_xs
blk\..*\.attn_gate\.weight=iq4_xs
blk\..*\.attn_qkv\.weight=iq4_xs

blk\..*\.ssm_alpha\.weight=q8_0
blk\..*\.ssm_beta\.weight=q8_0
blk\..*\.ssm_out\.weight=iq4_xs

blk\..*\.ffn_down\.weight=iq4_xs
blk\..*\.ffn_(gate|up)\.weight=iq4_xs

token_embd\.weight=iq4_xs
output\.weight=iq4_xs

Speed

llama-sweep-bench result with a RTX 3090, with flags -ngl 99 -mqkv -muge -cuda graphs=1 -c 128000 -wgt 1 -wb:

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 0.335 1526.19 2.632 48.64
512 128 10240 0.376 1362.66 2.787 45.93
512 128 20480 0.416 1231.97 2.870 44.60
512 128 30720 0.457 1119.71 2.964 43.19
512 128 40960 0.500 1024.24 3.080 41.56
512 128 51200 0.545 940.27 3.183 40.21
512 128 61440 0.589 868.63 3.277 39.06
512 128 71680 0.630 812.78 3.378 37.89
512 128 81920 0.673 760.29 3.497 36.60
512 128 92160 0.716 715.36 3.605 35.51
512 128 102400 0.761 672.98 3.696 34.64
512 128 112640 0.802 638.68 3.798 33.70
512 128 122880 0.843 607.28 3.917 32.68

Performance

This quant uses the imatrix from mradermacher. It performs well enough in long reasoning tasks and agentic tasks.

Downloads last month
246
GGUF
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sokann/Qwen3.6-27B-GGUF-4.256bpw

Base model

Qwen/Qwen3.6-27B
Quantized
(537)
this model