Instructions to use Qwen/Qwen3.5-27B-GPTQ-Int4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Qwen/Qwen3.5-27B-GPTQ-Int4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Qwen/Qwen3.5-27B-GPTQ-Int4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-27B-GPTQ-Int4")
model = AutoModelForMultimodalLM.from_pretrained("Qwen/Qwen3.5-27B-GPTQ-Int4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Qwen/Qwen3.5-27B-GPTQ-Int4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qwen/Qwen3.5-27B-GPTQ-Int4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.5-27B-GPTQ-Int4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Qwen/Qwen3.5-27B-GPTQ-Int4

SGLang

How to use Qwen/Qwen3.5-27B-GPTQ-Int4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Qwen/Qwen3.5-27B-GPTQ-Int4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.5-27B-GPTQ-Int4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Qwen/Qwen3.5-27B-GPTQ-Int4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.5-27B-GPTQ-Int4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Qwen/Qwen3.5-27B-GPTQ-Int4 with Docker Model Runner:
```
docker model run hf.co/Qwen/Qwen3.5-27B-GPTQ-Int4
```

fix chat template to avoid empty historical `<think>` blocks

by latent-variable - opened Apr 8

base: refs/heads/main

←

from: refs/pr/8

Discussion Files changed

-1

fix chat template to avoid empty historical `<think>` blocks19d73e5a

latent-variable

Apr 8

•

edited Apr 11

This fixes a chat template issue where historical assistant turns can emit empty <think>...</think> blocks even when reasoning_content is empty.

That matters because these empty historical <think> blocks change the serialized prompt without adding any useful information.

The fix is a really simple one-line change in the template:

from:

{%- if loop.index0 > ns.last_query_index %}

to:

{%- if loop.index0 > ns.last_query_index and reasoning_content %}

Why this is important:

it reduces unnecessary prompt drift
it improves prefix-cache reuse
it helps avoid avoidable cache misses
it reduces extra token processing caused by equivalent histories rendering differently

In practice, this means less wasted compute and better cache stability, especially in longer multi-turn or tool-using conversations.

The change is intentionally minimal:

keep the historical <think> wrapper when reasoning_content is actually present
do not emit an empty <think> block when there is no reasoning content

Without this guard, the template can produce prior turns like:

assistant
<think>

</think>

<tool_call>...

instead of rendering just the assistant content or tool call directly.

So this change preserves real reasoning content while avoiding empty reasoning scaffolding that can hurt caching behavior.

Edit: made a video explaining the bug
https://www.youtube.com/watch?v=3g70-ToSgr0

siberiamark

Apr 9

I think the template should be the same as the schema used in training .

latent-variable

Apr 9

•

edited Apr 9

I think the template should be the same as the schema used in training .

@siberiamark this change isn’t altering the live generation format the model relies on. It only avoids re-injecting empty historical <think></think> wrappers on later turns when there is no reasoning content there.

In practice, this was causing prompt drift and unnecessary cache invalidation across follow-up requests, while the model was already completing the original turns correctly.

more context here as well:
https://www.reddit.com/r/LocalLLaMA/

Edit: Made a video explaining the bug

latent-variable changed pull request title from fix chat template to avoid empty historical `<think>` blocks to fix historical assistant turn rendering in chat_template.jinja Apr 11

align historical assistant rendering with docs26b92cbc

latent-variable

Apr 11

•

edited Apr 11

small update after more testing: i tried the stricter version that removes historical <think> blocks entirely, but i think that one is too aggressive.

it seems better for cache reuse, but it may affect reasoning behavior / separation in some cases.

so i’m reverting these prs back to the safer minimal fix:

{%- if loop.index0 > ns.last_query_index and reasoning_content %}

that still fixes the empty historical wrapper issue without changing historical turns as aggressively.

latent-variable changed pull request title from fix historical assistant turn rendering in chat_template.jinja to fix chat template to avoid empty historical `<think>` blocks Apr 11

revert to safer historical think guard6b5df0bb

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment