How to use from
Lemonade
Pull the model
# Download Lemonade from https://lemonade-server.ai/
lemonade pull localweights/Qwen3.5-4B-MTP-IQ4_XS-GGUF:IQ4_XS
Run and chat with the model
lemonade run user.Qwen3.5-4B-MTP-IQ4_XS-GGUF-IQ4_XS
List all available models
lemonade list
Quick Links

Qwen3.5-4B-MTP-GGUF

Qwen3.5-4B (qwen35 dense hybrid arch) with NextN/MTP head preserved. Built via the patched convert_hf_to_gguf.py from patched llama.cpp build with Qwen3.5/3.6 MTP support.

Files

File Size Purpose
Qwen3.5-4B-MTP-bf16.gguf 8.1 GB Source for further quantization. NextN tensors at blk.32.
Qwen3.5-4B-MTP-IQ4_XS.gguf 2.5 GB Production-ready quant. Fits in tiny-council slots.

Build pipeline

python convert_hf_to_gguf.py /path/to/Qwen3.5-4B \
    --outfile Qwen3.5-4B-MTP-bf16.gguf
llama-quantize Qwen3.5-4B-MTP-bf16.gguf \
    Qwen3.5-4B-MTP-IQ4_XS.gguf IQ4_XS

Required adding the qwen35 pre-tokenizer chkhsh entry to convert_hf_to_gguf.py:1531 (vendored in the fork).

Optimal serving config (RTX 3090 Ti)

Recommended --spec-draft-n-max 2 for this model size. Larger n drops accept rate faster than throughput grows; sweet spot is shallower than the 27B/35B (which peak at n=4).

llama-server -m Qwen3.5-4B-MTP-IQ4_XS.gguf \
  -ngl 999 -fa on \
  --spec-type mtp --spec-draft-n-max 2 \
  --no-mmap \
  --ctx-size 8192 -ctk q4_0 -ctv q4_0 \
  --parallel 1 --kv-unified \
  --metrics --jinja

Performance โ€” --spec-draft-n-max sweep

Measured 2026-05-06, IQ4_XS, 3090 Ti, no thinking, 200-token decode:

n Decode tok/s Accept rate
1 250 100%
2 290 โ† peak 98%
3 280 82%
4 264 70%
5 223 54%
6 204 48%
8 188 36%

Without spec-decode (baseline): 207 tok/s. So peak MTP gives +40% vs baseline.

Metric Value
Decode (best, n=2) 290 t/s
Speedup vs no-spec +40%
VRAM @ 8K ctx ~2.7 GB

Tokenizer

qwen35 pre-tokenizer, 151,936 vocab. Standard chat template.

License

Apache 2.0.

Provenance

Built on Crucible: 9950X / 96 GB DDR5 / RTX 3090 Ti. Sibling: localweights/Qwen3.6-{27B,35B-A3B}-MTP-IQ4_XS-GGUF.

Downloads last month
237
GGUF
Model size
4B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for localweights/Qwen3.5-4B-MTP-IQ4_XS-GGUF

Finetuned
Qwen/Qwen3.5-4B
Quantized
(235)
this model