--- license: other base_model: huihui-ai/Huihui-Qwen3.5-27B-abliterated tags: - qwen3.5 - quantized - w4a16 - compressed-tensors - abliterated - vllm model_type: qwen3_5 quantized_by: j-a-a-a-y --- # Huihui-Qwen3.5-27B-abliterated — W4A16 (compressed-tensors) 4-bit weight quantization of [huihui-ai/Huihui-Qwen3.5-27B-abliterated](https://huggingface.co/huihui-ai/Huihui-Qwen3.5-27B-abliterated) using [llm-compressor](https://github.com/vllm-project/llm-compressor) W4A16 scheme. ## Key specs | Property | Value | |---|---| | Base model | Qwen3.5-27B (abliterated) | | Quantization | W4A16 (4-bit weights, 16-bit activations) | | Format | compressed-tensors (vLLM native) | | Size on disk | 17.6 GB | | GPU VRAM | ~16 GB (fits RTX 5090 32GB with MTP + KV cache) | | Calibration | 128 samples from Pile validation set | ## Usage with vLLM ```bash python -m vllm.entrypoints.openai.api_server \ --model j-a-a-a-y/Huihui-Qwen3.5-27B-abliterated-W4A16-compressed-tensors \ --served-model-name qwen3.5-27b \ --dtype float16 \ --max-model-len 4096 \ --speculative-config '{"method": "mtp", "num_speculative_tokens": 5}' \ --performance-mode interactivity ``` ## Benchmarks (RTX 5090, 32GB) Benchmarked with MTP=5 speculative decoding + CUDA graphs + interactivity mode: | Metric | Value | |---|---| | Single request (256 tok) | ~149 tok/s | | Single request (512 tok) | ~131 tok/s | | Batch=4 aggregate | ~410 tok/s | | MTP acceptance rate | 50% | ## Quantization details Quantized with vLLM's llm-compressor using the W4A16 scheme: - Per-group symmetric quantization (group_size=128) - Activation-aware calibration (128 samples, max_length=512) - lm_head kept at full precision Compared to GPTQ W4A16: ~2 GB smaller on disk, ~1.5 GB less VRAM, same inference speed (both use Marlin kernel).