Qwen3.5-27B-TQ8

TurboQuant-compressed version of Qwen/Qwen3.5-27B for near-lossless inference on Apple Silicon.

Compressed with turboquant-mlx-core using the TurboQuant algorithm (Zandieh et al., ICLR 2026).

Quality

Metric Value
fp16 PPL 1.45
TQ8 PPL 1.46
PPL delta 0.18%
Compression 56% of original size

Qwen3.5-27B is a hybrid architecture (full attention + linear attention + Mamba SSM) that quantizes exceptionally well due to its inherently compressible linear attention layers.

Quantization Config

Property Value
Method TurboQuant 4+4 residual (8 effective bits)
Rotation Walsh-Hadamard with hash-based signs
Codebooks Per-layer Lloyd-Max fitted
Sensitive layers First/last 4 at fp16
Block size Adaptive (largest power-of-2 dividing in_features)

Usage

# Serve via SwiftLM (dequants to BF16 on first load, cached for subsequent runs)
SwiftLM --model ekovshilovsky/Qwen3.5-27B-TQ8 --port 5413

# Dequant to fp16 for use with any MLX/HuggingFace loader
tq-dequant ./Qwen3.5-27B-TQ8 ./Qwen3.5-27B-fp16

Hardware Requirements

  • Apple Silicon Mac (M1 Pro+ recommended)
  • 64 GB unified memory minimum (29 GB model + KV cache + overhead)
  • macOS 14+

Original Model

This is a quantized version of Qwen/Qwen3.5-27B by Alibaba Cloud. The original model is released under the Apache 2.0 License. All original model terms and conditions apply.

Quantization

Quantization performed by Eugene Kovshilovsky using turboquant-mlx-core (MIT License).

Downloads last month
13
Safetensors
Model size
49B params
Tensor type
F32
U8
U32
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for ekovshilovsky/Qwen3.5-27B-TQ8

Base model

Qwen/Qwen3.5-27B
Quantized
(208)
this model