Instructions to use groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit") model = AutoModelForImageTextToText.from_pretrained("groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit
- SGLang
How to use groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit with Docker Model Runner:
docker model run hf.co/groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit
🚀 Qwen36-27B-GPTQ-Pro-4Bit
Welcome to Qwen36-27B-GPTQ-Pro-4Bit – a titan of reasoning and generation, elegantly squeezed into a remarkably efficient 4-bit package. It punches leagues above its weight class while keeping your VRAM happy and your inference speeds blazingly fast! Thank you Qwen team for another amazing model.
🌟 Why the "Pro"?
This isn't your average quantization. We used the GPTQ-Pro framework combined with the FOEM (First-Order Error Metric) approach. This advanced technique carefully preserves the most critical weights during the 4-bit compression process by evaluating the exact impact of quantization on the model's loss landscape.
The result?
- Near-Lossless Performance: Enjoy the profound reasoning, coding prowess, and vast knowledge of a 27 Billion parameter model, but with a drastically reduced memory footprint.
- Marlin Optimized: Ready out-of-the-box for Marlin kernels to deliver maximum token-per-second throughput in serving engines like vLLM.
- Consumer Hardware Friendly: Fit a massive 27B powerhouse model on consumer GPUs with room to spare for massive context lengths!
This repository contains a 4-bit GPTQ-Pro quantization of unsloth/Qwen3.6-27B, produced with GPTQModel and the FOEM/GPTAQ-style quality settings used in the GPTQ-Pro project.
Source project: https://github.com/groxaxo/GPTQ-Pro
Deployment
vLLM
CUDA_VISIBLE_DEVICES=0,1 vllm serve groxaxo/Qwen3.6-27B-GPTQ-Pro-4Bit \
--dtype float16 \
--quantization gptq_marlin \
--disable-custom-all-reduce \
--tensor-parallel-size 2 \
--max-model-len 132144 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--gpu-memory-utilization 0.92
Local path
CUDA_VISIBLE_DEVICES=0,1 vllm serve /path/to/Qwen3.6-27B-GPTQ-Pro-4Bit \
--dtype float16 \
--quantization gptq_marlin \
--disable-custom-all-reduce \
--tensor-parallel-size 2 \
--max-model-len 132144
Transformers
from gptqmodel import BACKEND, GPTQModel
model = GPTQModel.load(
"groxaxo/Qwen3.6-27B-GPTQ-Pro-4Bit",
backend=BACKEND.GPTQ_MARLIN,
device="cuda:0",
)
print(model.generate("Write a short deployment checklist.", max_new_tokens=64)[0])
Notes
- Tested with tensor parallel size 2 on RTX 3090 GPUs.
- Use
float16andgptq_marlinfor the most reliable vLLM startup path. - The quantization and serving workflow lives in the
GPTQ-Prorepository above. - MTP/speculative decoding is detected by vLLM for this model, but on 2x RTX 3090 the exact
--max-model-len 262144launch OOMs during KV-cache setup. - The working local vLLM configuration I verified is
--max-model-len 65536with--enforce-eager; that starts and serves, but the current metrics showedspec_decode_num_accepted_tokens_total=0, so it does not improve speed yet. - If you test MTP, use
--speculative-config '{"method":"mtp","num_speculative_tokens":2}'and disable thinking in the request payload when you want a plain answer.
⚡ Speed Benchmarks
Tested on 2× NVIDIA RTX 3090 with vLLM (gptq_marlin, tensor-parallel=2, float16).
| Metric | Value |
|---|---|
| Avg Generation Speed | 64.0 tok/s |
| Median Generation Speed | 64.0 tok/s |
| Peak Generation Speed | 65.0 tok/s |
| Avg Time-to-First-Token | 54 ms |
| Median TTFT | 56 ms |
📋 Detailed Run Results
Test 1: Short Prompt → 256 Tokens (Streaming)
| Run | TTFT | Tokens | Speed | Total Time |
|---|---|---|---|---|
| 1 | 60 ms | 256 | 64.0 tok/s | 4.04s |
| 2 | 55 ms | 256 | 64.0 tok/s | 4.04s |
| 3 | 56 ms | 256 | 62.4 tok/s | 4.14s |
Test 2: Medium Prompt → 512 Tokens (Non-Streaming)
| Run | Tokens | Speed | Total Time |
|---|---|---|---|
| 1 | 512 | 62.9 tok/s | 8.15s |
| 2 | 512 | 63.0 tok/s | 8.13s |
| 3 | 512 | 62.9 tok/s | 8.14s |
Test 3: Short Burst → 64 Tokens (Streaming)
| Run | TTFT | Tokens | Speed |
|---|---|---|---|
| 1 | 50 ms | 64 | 65.0 tok/s |
| 2 | 56 ms | 64 | 64.9 tok/s |
| 3 | 56 ms | 64 | 64.7 tok/s |
| 4 | 54 ms | 64 | 64.9 tok/s |
| 5 | 48 ms | 64 | 64.9 tok/s |
📊 Quality Evaluation
- Wikitext-2 test perplexity: 6.366 (n_ctx=1024)
- Downloads last month
- 162,910
Model tree for groxaxo/Qwen3.6-27B-GPTQ-Pro-4bit
Base model
Qwen/Qwen3.6-27B