---
title: Qwen36-27B-GPTQ-Pro-4Bit
tags:
- text-generation-inference
- transformers
- qwen
- gptq
- marlin
- foem
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen3.6-27B
---

![Qwen36-27B-GPTQ-Pro-4Bit Banner](bano.png)

# 🚀 Qwen36-27B-GPTQ-Pro-4Bit

Welcome to **Qwen36-27B-GPTQ-Pro-4Bit** – a titan of reasoning and generation, elegantly squeezed into a remarkably efficient 4-bit package.  It punches leagues above its weight class while keeping your VRAM happy and your inference speeds blazingly fast! Thank you  Qwen team for another amazing model.

## 🌟 Why the "Pro"?
This isn't your average quantization. We used the **GPTQ-Pro** framework combined with the **FOEM** (First-Order Error Metric) approach. This advanced technique carefully preserves the most critical weights during the 4-bit compression process by evaluating the exact impact of quantization on the model's loss landscape. 

The result?
- **Near-Lossless Performance**: Enjoy the profound reasoning, coding prowess, and vast knowledge of a 27 Billion parameter model, but with a drastically reduced memory footprint.
- **Marlin Optimized**: Ready out-of-the-box for Marlin kernels to deliver maximum token-per-second throughput in serving engines like vLLM.
- **Consumer Hardware Friendly**: Fit a massive 27B powerhouse model on consumer GPUs with room to spare for massive context lengths!

This repository contains a 4-bit GPTQ-Pro quantization of `unsloth/Qwen3.6-27B`, produced with GPTQModel and the FOEM/GPTAQ-style quality settings used in the `GPTQ-Pro` project.

Source project: https://github.com/groxaxo/GPTQ-Pro

## Deployment

### vLLM

```bash
CUDA_VISIBLE_DEVICES=0,1 vllm serve groxaxo/Qwen3.6-27B-GPTQ-Pro-4Bit \
  --dtype float16 \
  --quantization gptq_marlin \
  --disable-custom-all-reduce \
  --tensor-parallel-size 2 \
  --max-model-len 132144 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --gpu-memory-utilization 0.92
```

### Local path

```bash
CUDA_VISIBLE_DEVICES=0,1 vllm serve /path/to/Qwen3.6-27B-GPTQ-Pro-4Bit \
  --dtype float16 \
  --quantization gptq_marlin \
  --disable-custom-all-reduce \
  --tensor-parallel-size 2 \
  --max-model-len 132144
```

### Transformers

```python
from gptqmodel import BACKEND, GPTQModel

model = GPTQModel.load(
    "groxaxo/Qwen3.6-27B-GPTQ-Pro-4Bit",
    backend=BACKEND.GPTQ_MARLIN,
    device="cuda:0",
)

print(model.generate("Write a short deployment checklist.", max_new_tokens=64)[0])
```

## Notes

- Tested with tensor parallel size 2 on RTX 3090 GPUs.
- Use `float16` and `gptq_marlin` for the most reliable vLLM startup path.
- The quantization and serving workflow lives in the `GPTQ-Pro` repository above.
- MTP/speculative decoding is detected by vLLM for this model, but on 2x RTX 3090 the exact `--max-model-len 262144` launch OOMs during KV-cache setup.
- The working local vLLM configuration I verified is `--max-model-len 65536` with `--enforce-eager`; that starts and serves, but the current metrics showed `spec_decode_num_accepted_tokens_total=0`, so it does not improve speed yet.
- If you test MTP, use `--speculative-config '{"method":"mtp","num_speculative_tokens":2}'` and disable thinking in the request payload when you want a plain answer.

## ⚡ Speed Benchmarks

Tested on **2× NVIDIA RTX 3090** with vLLM (gptq_marlin, tensor-parallel=2, float16).

| Metric | Value |
|---|---|
| **Avg Generation Speed** | 64.0 tok/s |
| **Median Generation Speed** | 64.0 tok/s |
| **Peak Generation Speed** | 65.0 tok/s |
| **Avg Time-to-First-Token** | 54 ms |
| **Median TTFT** | 56 ms |

<details>
<summary>📋 Detailed Run Results</summary>

### Test 1: Short Prompt → 256 Tokens (Streaming)
| Run | TTFT | Tokens | Speed | Total Time |
|---|---|---|---|---|
| 1 | 60 ms | 256 | 64.0 tok/s | 4.04s |
| 2 | 55 ms | 256 | 64.0 tok/s | 4.04s |
| 3 | 56 ms | 256 | 62.4 tok/s | 4.14s |

### Test 2: Medium Prompt → 512 Tokens (Non-Streaming)
| Run | Tokens | Speed | Total Time |
|---|---|---|---|
| 1 | 512 | 62.9 tok/s | 8.15s |
| 2 | 512 | 63.0 tok/s | 8.13s |
| 3 | 512 | 62.9 tok/s | 8.14s |

### Test 3: Short Burst → 64 Tokens (Streaming)
| Run | TTFT | Tokens | Speed |
|---|---|---|---|
| 1 | 50 ms | 64 | 65.0 tok/s |
| 2 | 56 ms | 64 | 64.9 tok/s |
| 3 | 56 ms | 64 | 64.7 tok/s |
| 4 | 54 ms | 64 | 64.9 tok/s |
| 5 | 48 ms | 64 | 64.9 tok/s |

</details>

## 📊 Quality Evaluation

-  Wikitext-2 test perplexity: **6.366** (n_ctx=1024)