Upload W4A16 compressed-tensors quantization of Huihui-Qwen3.5-27B-abliterated

1bec8e2 verified 3 months ago

1.84 kB

	---
	license: other
	base_model: huihui-ai/Huihui-Qwen3.5-27B-abliterated
	tags:
	- qwen3.5
	- quantized
	- w4a16
	- compressed-tensors
	- abliterated
	- vllm
	model_type: qwen3_5
	quantized_by: j-a-a-a-y
	---

	# Huihui-Qwen3.5-27B-abliterated — W4A16 (compressed-tensors)

	4-bit weight quantization of [huihui-ai/Huihui-Qwen3.5-27B-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3.5-27B-abliterated) using [llm-compressor](https://github.com/vllm-project/llm-compressor) W4A16 scheme.

	## Key specs

	\| Property \| Value \|
	\|---\|---\|
	\| Base model \| Qwen3.5-27B (abliterated) \|
	\| Quantization \| W4A16 (4-bit weights, 16-bit activations) \|
	\| Format \| compressed-tensors (vLLM native) \|
	\| Size on disk \| 17.6 GB \|
	\| GPU VRAM \| ~16 GB (fits RTX 5090 32GB with MTP + KV cache) \|
	\| Calibration \| 128 samples from Pile validation set \|

	## Usage with vLLM

	```bash
	python -m vllm.entrypoints.openai.api_server \
	--model j-a-a-a-y/Huihui-Qwen3.5-27B-abliterated-W4A16-compressed-tensors \
	--served-model-name qwen3.5-27b \
	--dtype float16 \
	--max-model-len 4096 \
	--speculative-config '{"method": "mtp", "num_speculative_tokens": 5}' \
	--performance-mode interactivity
	```

	## Benchmarks (RTX 5090, 32GB)

	Benchmarked with MTP=5 speculative decoding + CUDA graphs + interactivity mode:

	\| Metric \| Value \|
	\|---\|---\|
	\| Single request (256 tok) \| ~149 tok/s \|
	\| Single request (512 tok) \| ~131 tok/s \|
	\| Batch=4 aggregate \| ~410 tok/s \|
	\| MTP acceptance rate \| 50% \|

	## Quantization details

	Quantized with vLLM's llm-compressor using the W4A16 scheme:
	- Per-group symmetric quantization (group_size=128)
	- Activation-aware calibration (128 samples, max_length=512)
	- lm_head kept at full precision

	Compared to GPTQ W4A16: ~2 GB smaller on disk, ~1.5 GB less VRAM, same inference speed (both use Marlin kernel).