| --- |
| license: other |
| base_model: huihui-ai/Huihui-Qwen3.5-27B-abliterated |
| tags: |
| - qwen3.5 |
| - quantized |
| - w4a16 |
| - compressed-tensors |
| - abliterated |
| - vllm |
| model_type: qwen3_5 |
| quantized_by: j-a-a-a-y |
| --- |
| |
| # Huihui-Qwen3.5-27B-abliterated — W4A16 (compressed-tensors) |
|
|
| 4-bit weight quantization of [huihui-ai/Huihui-Qwen3.5-27B-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3.5-27B-abliterated) using [llm-compressor](https://github.com/vllm-project/llm-compressor) W4A16 scheme. |
|
|
| ## Key specs |
|
|
| | Property | Value | |
| |---|---| |
| | Base model | Qwen3.5-27B (abliterated) | |
| | Quantization | W4A16 (4-bit weights, 16-bit activations) | |
| | Format | compressed-tensors (vLLM native) | |
| | Size on disk | 17.6 GB | |
| | GPU VRAM | ~16 GB (fits RTX 5090 32GB with MTP + KV cache) | |
| | Calibration | 128 samples from Pile validation set | |
|
|
| ## Usage with vLLM |
|
|
| ```bash |
| python -m vllm.entrypoints.openai.api_server \ |
| --model j-a-a-a-y/Huihui-Qwen3.5-27B-abliterated-W4A16-compressed-tensors \ |
| --served-model-name qwen3.5-27b \ |
| --dtype float16 \ |
| --max-model-len 4096 \ |
| --speculative-config '{"method": "mtp", "num_speculative_tokens": 5}' \ |
| --performance-mode interactivity |
| ``` |
|
|
| ## Benchmarks (RTX 5090, 32GB) |
|
|
| Benchmarked with MTP=5 speculative decoding + CUDA graphs + interactivity mode: |
|
|
| | Metric | Value | |
| |---|---| |
| | Single request (256 tok) | ~149 tok/s | |
| | Single request (512 tok) | ~131 tok/s | |
| | Batch=4 aggregate | ~410 tok/s | |
| | MTP acceptance rate | 50% | |
|
|
| ## Quantization details |
|
|
| Quantized with vLLM's llm-compressor using the W4A16 scheme: |
| - Per-group symmetric quantization (group_size=128) |
| - Activation-aware calibration (128 samples, max_length=512) |
| - lm_head kept at full precision |
| |
| Compared to GPTQ W4A16: ~2 GB smaller on disk, ~1.5 GB less VRAM, same inference speed (both use Marlin kernel). |
| |