Upload W4A16 compressed-tensors quantization of Huihui-Qwen3.5-27B-abliterated

1bec8e2 verified 3 months ago

1.84 kB

license: other
base_model: huihui-ai/Huihui-Qwen3.5-27B-abliterated
tags:
  - qwen3.5
  - quantized
  - w4a16
  - compressed-tensors
  - abliterated
  - vllm
model_type: qwen3_5
quantized_by: j-a-a-a-y

Huihui-Qwen3.5-27B-abliterated — W4A16 (compressed-tensors)

4-bit weight quantization of huihui-ai/Huihui-Qwen3.5-27B-abliterated using llm-compressor W4A16 scheme.

Key specs

Property	Value
Base model	Qwen3.5-27B (abliterated)
Quantization	W4A16 (4-bit weights, 16-bit activations)
Format	compressed-tensors (vLLM native)
Size on disk	17.6 GB
GPU VRAM	~16 GB (fits RTX 5090 32GB with MTP + KV cache)
Calibration	128 samples from Pile validation set

Usage with vLLM

python -m vllm.entrypoints.openai.api_server \
    --model j-a-a-a-y/Huihui-Qwen3.5-27B-abliterated-W4A16-compressed-tensors \
    --served-model-name qwen3.5-27b \
    --dtype float16 \
    --max-model-len 4096 \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 5}' \
    --performance-mode interactivity

Benchmarks (RTX 5090, 32GB)

Benchmarked with MTP=5 speculative decoding + CUDA graphs + interactivity mode:

Metric	Value
Single request (256 tok)	~149 tok/s
Single request (512 tok)	~131 tok/s
Batch=4 aggregate	~410 tok/s
MTP acceptance rate	50%

Quantization details

Quantized with vLLM's llm-compressor using the W4A16 scheme:

Per-group symmetric quantization (group_size=128)
Activation-aware calibration (128 samples, max_length=512)
lm_head kept at full precision

Compared to GPTQ W4A16: ~2 GB smaller on disk, ~1.5 GB less VRAM, same inference speed (both use Marlin kernel).