File size: 1,838 Bytes
1bec8e2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | ---
license: other
base_model: huihui-ai/Huihui-Qwen3.5-27B-abliterated
tags:
- qwen3.5
- quantized
- w4a16
- compressed-tensors
- abliterated
- vllm
model_type: qwen3_5
quantized_by: j-a-a-a-y
---
# Huihui-Qwen3.5-27B-abliterated — W4A16 (compressed-tensors)
4-bit weight quantization of [huihui-ai/Huihui-Qwen3.5-27B-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3.5-27B-abliterated) using [llm-compressor](https://github.com/vllm-project/llm-compressor) W4A16 scheme.
## Key specs
| Property | Value |
|---|---|
| Base model | Qwen3.5-27B (abliterated) |
| Quantization | W4A16 (4-bit weights, 16-bit activations) |
| Format | compressed-tensors (vLLM native) |
| Size on disk | 17.6 GB |
| GPU VRAM | ~16 GB (fits RTX 5090 32GB with MTP + KV cache) |
| Calibration | 128 samples from Pile validation set |
## Usage with vLLM
```bash
python -m vllm.entrypoints.openai.api_server \
--model j-a-a-a-y/Huihui-Qwen3.5-27B-abliterated-W4A16-compressed-tensors \
--served-model-name qwen3.5-27b \
--dtype float16 \
--max-model-len 4096 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 5}' \
--performance-mode interactivity
```
## Benchmarks (RTX 5090, 32GB)
Benchmarked with MTP=5 speculative decoding + CUDA graphs + interactivity mode:
| Metric | Value |
|---|---|
| Single request (256 tok) | ~149 tok/s |
| Single request (512 tok) | ~131 tok/s |
| Batch=4 aggregate | ~410 tok/s |
| MTP acceptance rate | 50% |
## Quantization details
Quantized with vLLM's llm-compressor using the W4A16 scheme:
- Per-group symmetric quantization (group_size=128)
- Activation-aware calibration (128 samples, max_length=512)
- lm_head kept at full precision
Compared to GPTQ W4A16: ~2 GB smaller on disk, ~1.5 GB less VRAM, same inference speed (both use Marlin kernel).
|