Qwen3.6-27B-GGUF-4.256bpw

This is a 4.256 BPW quantized model for the GPU poors with 16 GiB of VRAM. It works in both ik_llama.cpp and mainline llama.cpp.

It was quantized using the simplest ever recipe – Q8_0 for the tiny ssm_alpha and ssm_beta tensors, IQ4_XS for the rest.

From local testing with llama-perplexity (wiki.test.raw, 580 chunks), it has the best quality and speed in the same size class:

quant	this	bartowski Q3_K_M	unsloth UD-Q3_K_XL	mradermacher i1.IQ4_XS	bartowski IQ4_XS	unsloth IQ4_XS
Size (BPW)	4.256	4.270	4.302	4.483	4.556	4.589
Size (GiB)	13.327	13.370	13.469	14.036	14.266	14.369
VRAM usage (GiB)	12.698	12.861	12.803	13.407	13.637	13.703
Mean PPL(Q)	7.098696 ± 0.047344	6.993009 ± 0.046208	6.995519 ± 0.046227	7.020660 ± 0.046587	6.996323 ± 0.046332	6.950126 ± 0.045846
Mean PPL(base)	6.908506 ± 0.045543	6.908506 ± 0.045543	6.908506 ± 0.045543	6.908506 ± 0.045543	6.908506 ± 0.045543	6.908506 ± 0.045543
Cor(ln(PPL(Q)), ln(PPL(base)))	99.19%	98.52%	98.82%	99.30%	99.32%	99.38%
Mean KLD	0.033452 ± 0.000723	0.058818 ± 0.000881	0.046348 ± 0.000841	0.027289 ± 0.000660	0.026270 ± 0.000653	0.024728 ± 0.000603
Maximum KLD	23.255085	24.616274	24.175169	18.568180	22.992002	21.687405
99.9% KLD	2.907350	3.986622	3.614290	2.667850	2.385293	2.201674
RMS Δp	4.936 ± 0.054 %	6.690 ± 0.059 %	5.867 ± 0.060 %	4.449 ± 0.057 %	4.352 ± 0.057 %	4.264 ± 0.056 %
Same top p	92.427 ± 0.069 %	90.350 ± 0.077 %	91.829 ± 0.071 %	93.903 ± 0.062 %	93.888 ± 0.062 %	93.997 ± 0.062 %

Compared to Q3_K_M from bartowski and UD-Q3_K_XL from unsloth, this IQ4_XS quant uses slightly less VRAM while having better quality.
The IQ4_XS quant from mradermacher, bartowski, and unsloth have much better quality, but they use more VRAM and are harder to fit into 16 GiB of VRAM.

With 16 GiB of VRAM, we can fit a context size of 65536 with quantized KV cache:

# mainline llama.cpp
-c 65536 -ctk q8_0 -ctv q8_0 -np 1

For brave souls that seek the TurboQuant experience (see #21038), we can also fit a context size of 128000 with more heavily quantized KV cache:

# mainline llama.cpp
-c 128000 -ctk q4_0 -ctv q4_0 -np 1

Size

Size from llama-server output:

llm_load_print_meta: model size       = 13.327 GiB (4.256 BPW)
llm_load_print_meta: repeating layers = 12.069 GiB (4.257 BPW, 24.353 B parameters)
...
llm_load_tensors:  CUDA_Host buffer size =   644.14 MiB
llm_load_tensors:      CUDA0 buffer size = 13003.14 MiB

Recipe

blk\..*\.attn_q\.weight=iq4_xs
blk\..*\.attn_k\.weight=iq4_xs
blk\..*\.attn_v\.weight=iq4_xs
blk\..*\.attn_output\.weight=iq4_xs
blk\..*\.attn_gate\.weight=iq4_xs
blk\..*\.attn_qkv\.weight=iq4_xs

blk\..*\.ssm_alpha\.weight=q8_0
blk\..*\.ssm_beta\.weight=q8_0
blk\..*\.ssm_out\.weight=iq4_xs

blk\..*\.ffn_down\.weight=iq4_xs
blk\..*\.ffn_(gate|up)\.weight=iq4_xs

token_embd\.weight=iq4_xs
output\.weight=iq4_xs

Speed

llama-sweep-bench result with a RTX 3090, with flags -ngl 99 -mqkv -muge -cuda graphs=1 -c 128000 -wgt 1 -wb:

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.335	1526.19	2.632	48.64
512	128	10240	0.376	1362.66	2.787	45.93
512	128	20480	0.416	1231.97	2.870	44.60
512	128	30720	0.457	1119.71	2.964	43.19
512	128	40960	0.500	1024.24	3.080	41.56
512	128	51200	0.545	940.27	3.183	40.21
512	128	61440	0.589	868.63	3.277	39.06
512	128	71680	0.630	812.78	3.378	37.89
512	128	81920	0.673	760.29	3.497	36.60
512	128	92160	0.716	715.36	3.605	35.51
512	128	102400	0.761	672.98	3.696	34.64
512	128	112640	0.802	638.68	3.798	33.70
512	128	122880	0.843	607.28	3.917	32.68

Performance

This quant uses the imatrix from mradermacher. It performs well enough in long reasoning tasks and agentic tasks.

Downloads last month: 246

GGUF

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sokann/Qwen3.6-27B-GGUF-4.256bpw

Base model

Qwen/Qwen3.6-27B

Quantized

(537)

this model