File size: 1,838 Bytes
1bec8e2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
---
license: other
base_model: huihui-ai/Huihui-Qwen3.5-27B-abliterated
tags:
  - qwen3.5
  - quantized
  - w4a16
  - compressed-tensors
  - abliterated
  - vllm
model_type: qwen3_5
quantized_by: j-a-a-a-y
---

# Huihui-Qwen3.5-27B-abliterated — W4A16 (compressed-tensors)

4-bit weight quantization of [huihui-ai/Huihui-Qwen3.5-27B-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3.5-27B-abliterated) using [llm-compressor](https://github.com/vllm-project/llm-compressor) W4A16 scheme.

## Key specs

| Property | Value |
|---|---|
| Base model | Qwen3.5-27B (abliterated) |
| Quantization | W4A16 (4-bit weights, 16-bit activations) |
| Format | compressed-tensors (vLLM native) |
| Size on disk | 17.6 GB |
| GPU VRAM | ~16 GB (fits RTX 5090 32GB with MTP + KV cache) |
| Calibration | 128 samples from Pile validation set |

## Usage with vLLM

```bash
python -m vllm.entrypoints.openai.api_server \
    --model j-a-a-a-y/Huihui-Qwen3.5-27B-abliterated-W4A16-compressed-tensors \
    --served-model-name qwen3.5-27b \
    --dtype float16 \
    --max-model-len 4096 \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 5}' \
    --performance-mode interactivity
```

## Benchmarks (RTX 5090, 32GB)

Benchmarked with MTP=5 speculative decoding + CUDA graphs + interactivity mode:

| Metric | Value |
|---|---|
| Single request (256 tok) | ~149 tok/s |
| Single request (512 tok) | ~131 tok/s |
| Batch=4 aggregate | ~410 tok/s |
| MTP acceptance rate | 50% |

## Quantization details

Quantized with vLLM's llm-compressor using the W4A16 scheme:
- Per-group symmetric quantization (group_size=128)
- Activation-aware calibration (128 samples, max_length=512)
- lm_head kept at full precision

Compared to GPTQ W4A16: ~2 GB smaller on disk, ~1.5 GB less VRAM, same inference speed (both use Marlin kernel).