Qwen3.5-27B-TQ8

TurboQuant-compressed version of Qwen/Qwen3.5-27B for near-lossless inference on Apple Silicon.

Compressed with turboquant-mlx-core using the TurboQuant algorithm (Zandieh et al., ICLR 2026).

Quality

Metric	Value
fp16 PPL	1.45
TQ8 PPL	1.46
PPL delta	0.18%
Compression	56% of original size

Qwen3.5-27B is a hybrid architecture (full attention + linear attention + Mamba SSM) that quantizes exceptionally well due to its inherently compressible linear attention layers.

Quantization Config

Property	Value
Method	TurboQuant 4+4 residual (8 effective bits)
Rotation	Walsh-Hadamard with hash-based signs
Codebooks	Per-layer Lloyd-Max fitted
Sensitive layers	First/last 4 at fp16
Block size	Adaptive (largest power-of-2 dividing in_features)

Usage

# Serve via SwiftLM (dequants to BF16 on first load, cached for subsequent runs)
SwiftLM --model ekovshilovsky/Qwen3.5-27B-TQ8 --port 5413

# Dequant to fp16 for use with any MLX/HuggingFace loader
tq-dequant ./Qwen3.5-27B-TQ8 ./Qwen3.5-27B-fp16

Hardware Requirements

Apple Silicon Mac (M1 Pro+ recommended)
64 GB unified memory minimum (29 GB model + KV cache + overhead)
macOS 14+

Original Model

This is a quantized version of Qwen/Qwen3.5-27B by Alibaba Cloud. The original model is released under the Apache 2.0 License. All original model terms and conditions apply.

Quantization

Quantization performed by Eugene Kovshilovsky using turboquant-mlx-core (MIT License).

Downloads last month: 13

Safetensors

Model size

49B params

Tensor type

F32

U32

MLX

Hardware compatibility

4-bit

Model tree for ekovshilovsky/Qwen3.5-27B-TQ8

Base model

Qwen/Qwen3.5-27B

Quantized

(208)

this model