--- title: Qwen36-27B-GPTQ-Pro-4Bit tags: - text-generation-inference - transformers - qwen - gptq - marlin - foem license: apache-2.0 language: - en base_model: - Qwen/Qwen3.6-27B --- ![Qwen36-27B-GPTQ-Pro-4Bit Banner](bano.png) # 🚀 Qwen36-27B-GPTQ-Pro-4Bit Welcome to **Qwen36-27B-GPTQ-Pro-4Bit** – a titan of reasoning and generation, elegantly squeezed into a remarkably efficient 4-bit package. It punches leagues above its weight class while keeping your VRAM happy and your inference speeds blazingly fast! Thank you Qwen team for another amazing model. ## 🌟 Why the "Pro"? This isn't your average quantization. We used the **GPTQ-Pro** framework combined with the **FOEM** (First-Order Error Metric) approach. This advanced technique carefully preserves the most critical weights during the 4-bit compression process by evaluating the exact impact of quantization on the model's loss landscape. The result? - **Near-Lossless Performance**: Enjoy the profound reasoning, coding prowess, and vast knowledge of a 27 Billion parameter model, but with a drastically reduced memory footprint. - **Marlin Optimized**: Ready out-of-the-box for Marlin kernels to deliver maximum token-per-second throughput in serving engines like vLLM. - **Consumer Hardware Friendly**: Fit a massive 27B powerhouse model on consumer GPUs with room to spare for massive context lengths! This repository contains a 4-bit GPTQ-Pro quantization of `unsloth/Qwen3.6-27B`, produced with GPTQModel and the FOEM/GPTAQ-style quality settings used in the `GPTQ-Pro` project. Source project: https://github.com/groxaxo/GPTQ-Pro ## Deployment ### vLLM ```bash CUDA_VISIBLE_DEVICES=0,1 vllm serve groxaxo/Qwen3.6-27B-GPTQ-Pro-4Bit \ --dtype float16 \ --quantization gptq_marlin \ --disable-custom-all-reduce \ --tensor-parallel-size 2 \ --max-model-len 132144 \ --reasoning-parser qwen3 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --gpu-memory-utilization 0.92 ``` ### Local path ```bash CUDA_VISIBLE_DEVICES=0,1 vllm serve /path/to/Qwen3.6-27B-GPTQ-Pro-4Bit \ --dtype float16 \ --quantization gptq_marlin \ --disable-custom-all-reduce \ --tensor-parallel-size 2 \ --max-model-len 132144 ``` ### Transformers ```python from gptqmodel import BACKEND, GPTQModel model = GPTQModel.load( "groxaxo/Qwen3.6-27B-GPTQ-Pro-4Bit", backend=BACKEND.GPTQ_MARLIN, device="cuda:0", ) print(model.generate("Write a short deployment checklist.", max_new_tokens=64)[0]) ``` ## Notes - Tested with tensor parallel size 2 on RTX 3090 GPUs. - Use `float16` and `gptq_marlin` for the most reliable vLLM startup path. - The quantization and serving workflow lives in the `GPTQ-Pro` repository above. - MTP/speculative decoding is detected by vLLM for this model, but on 2x RTX 3090 the exact `--max-model-len 262144` launch OOMs during KV-cache setup. - The working local vLLM configuration I verified is `--max-model-len 65536` with `--enforce-eager`; that starts and serves, but the current metrics showed `spec_decode_num_accepted_tokens_total=0`, so it does not improve speed yet. - If you test MTP, use `--speculative-config '{"method":"mtp","num_speculative_tokens":2}'` and disable thinking in the request payload when you want a plain answer. ## ⚡ Speed Benchmarks Tested on **2× NVIDIA RTX 3090** with vLLM (gptq_marlin, tensor-parallel=2, float16). | Metric | Value | |---|---| | **Avg Generation Speed** | 64.0 tok/s | | **Median Generation Speed** | 64.0 tok/s | | **Peak Generation Speed** | 65.0 tok/s | | **Avg Time-to-First-Token** | 54 ms | | **Median TTFT** | 56 ms |
📋 Detailed Run Results ### Test 1: Short Prompt → 256 Tokens (Streaming) | Run | TTFT | Tokens | Speed | Total Time | |---|---|---|---|---| | 1 | 60 ms | 256 | 64.0 tok/s | 4.04s | | 2 | 55 ms | 256 | 64.0 tok/s | 4.04s | | 3 | 56 ms | 256 | 62.4 tok/s | 4.14s | ### Test 2: Medium Prompt → 512 Tokens (Non-Streaming) | Run | Tokens | Speed | Total Time | |---|---|---|---| | 1 | 512 | 62.9 tok/s | 8.15s | | 2 | 512 | 63.0 tok/s | 8.13s | | 3 | 512 | 62.9 tok/s | 8.14s | ### Test 3: Short Burst → 64 Tokens (Streaming) | Run | TTFT | Tokens | Speed | |---|---|---|---| | 1 | 50 ms | 64 | 65.0 tok/s | | 2 | 56 ms | 64 | 64.9 tok/s | | 3 | 56 ms | 64 | 64.7 tok/s | | 4 | 54 ms | 64 | 64.9 tok/s | | 5 | 48 ms | 64 | 64.9 tok/s |
## 📊 Quality Evaluation - Wikitext-2 test perplexity: **6.366** (n_ctx=1024)