How to use from
Lemonade
Pull the model
# Download Lemonade from https://lemonade-server.ai/
lemonade pull localweights/Qwen3.5-4B-MTP-Q4_K_M-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Qwen3.5-4B-MTP-Q4_K_M-GGUF-Q4_K_M
List all available models
lemonade list
Quick Links

Qwen3.5-4B-MTP-Q4_K_M-GGUF

Qwen3.5-4B (qwen35 dense+ssm hybrid arch) with NextN/MTP head preserved, quantized to Q4_K_M (5.25 BPW). Built via the patched convert_hf_to_gguf.py from patched llama.cpp build with Qwen3.5/3.6 MTP support.

Sibling: localweights/Qwen3.5-4B-MTP-IQ4_XS-GGUF (smaller, faster, 4.25 BPW).

Files

File Size Purpose
Qwen3.5-4B-MTP-Q4_K_M.gguf 2.7 GB Q4_K_M with 3 outlier tensors kept at bf16.

Build pipeline

python convert_hf_to_gguf.py /path/to/Qwen3.5-4B \
    --outfile Qwen3.5-4B-MTP-bf16.gguf

# 3 tensors have absurdly large values (1e19 to 1e36) that overflow Q4_K's
# fp16 scale-block storage. Keep them at bf16 to pass row-data validation.
llama-quantize \
  --tensor-type "blk\.9\.ssm_out\.weight=bf16" \
  --tensor-type "blk\.15\.attn_output\.weight=bf16" \
  --tensor-type "blk\.24\.ffn_gate\.weight=bf16" \
  Qwen3.5-4B-MTP-bf16.gguf \
  Qwen3.5-4B-MTP-Q4_K_M.gguf \
  Q4_K_M

Optimal serving config

llama-server -m Qwen3.5-4B-MTP-Q4_K_M.gguf \
  -ngl 999 -fa on \
  --spec-type mtp --spec-draft-n-max 2 \
  --no-mmap \
  --ctx-size 8192 -ctk q4_0 -ctv q4_0 \
  --parallel 1 --kv-unified \
  --metrics --jinja

Performance โ€” --spec-draft-n-max sweep

Measured 2026-05-06 on Crucible (9950X, 96 GB DDR5-4800 dual-channel, RTX 3090 Ti). Prompt: count 1โ†’50, 300-token decode.

GPU (RTX 3090 Ti, full offload)

n Decode tok/s Accept rate
2 271 โ† peak 100%
3 273 90%

CPU (16 threads, 2-channel DDR5-4800)

n Decode tok/s Accept rate
1 24.2 100%
2 30.4 โ† peak 100%
3 31.1 87%
4 30.9 75%
5 25.2 55%

Vs sibling IQ4_XS at peak: 271 t/s GPU (vs 289), 30.4 t/s CPU (vs 35.9). Q4_K_M trades ~6โ€“15% throughput for ~24% more bits per weight.

Tokenizer

qwen35 pre-tokenizer, 151,936 vocab. Standard chat template.

License

Apache 2.0.

Provenance

Built on Crucible: 9950X / 96 GB DDR5 / RTX 3090 Ti. Sibling repos: localweights/Qwen3.5-4B-MTP-IQ4_XS-GGUF, localweights/Qwen3.6-{27B,35B-A3B}-MTP-IQ4_XS-GGUF.

Downloads last month
793
GGUF
Model size
4B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for localweights/Qwen3.5-4B-MTP-Q4_K_M-GGUF

Finetuned
Qwen/Qwen3.5-4B
Quantized
(235)
this model