j-a-a-a-y's picture
Upload W4A16 compressed-tensors quantization of Huihui-Qwen3.5-27B-abliterated
1bec8e2 verified
|
Raw
History Blame Contribute Delete
1.84 kB
metadata
license: other
base_model: huihui-ai/Huihui-Qwen3.5-27B-abliterated
tags:
  - qwen3.5
  - quantized
  - w4a16
  - compressed-tensors
  - abliterated
  - vllm
model_type: qwen3_5
quantized_by: j-a-a-a-y

Huihui-Qwen3.5-27B-abliterated — W4A16 (compressed-tensors)

4-bit weight quantization of huihui-ai/Huihui-Qwen3.5-27B-abliterated using llm-compressor W4A16 scheme.

Key specs

Property Value
Base model Qwen3.5-27B (abliterated)
Quantization W4A16 (4-bit weights, 16-bit activations)
Format compressed-tensors (vLLM native)
Size on disk 17.6 GB
GPU VRAM ~16 GB (fits RTX 5090 32GB with MTP + KV cache)
Calibration 128 samples from Pile validation set

Usage with vLLM

python -m vllm.entrypoints.openai.api_server \
    --model j-a-a-a-y/Huihui-Qwen3.5-27B-abliterated-W4A16-compressed-tensors \
    --served-model-name qwen3.5-27b \
    --dtype float16 \
    --max-model-len 4096 \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 5}' \
    --performance-mode interactivity

Benchmarks (RTX 5090, 32GB)

Benchmarked with MTP=5 speculative decoding + CUDA graphs + interactivity mode:

Metric Value
Single request (256 tok) ~149 tok/s
Single request (512 tok) ~131 tok/s
Batch=4 aggregate ~410 tok/s
MTP acceptance rate 50%

Quantization details

Quantized with vLLM's llm-compressor using the W4A16 scheme:

  • Per-group symmetric quantization (group_size=128)
  • Activation-aware calibration (128 samples, max_length=512)
  • lm_head kept at full precision

Compared to GPTQ W4A16: ~2 GB smaller on disk, ~1.5 GB less VRAM, same inference speed (both use Marlin kernel).