Instructions to use ubergarm/Qwen3.5-27B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ubergarm/Qwen3.5-27B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ubergarm/Qwen3.5-27B-GGUF",
	filename="Qwen3.5-27B-IQ5_KS.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use ubergarm/Qwen3.5-27B-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
# Run inference directly in the terminal:
llama cli -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
# Run inference directly in the terminal:
llama cli -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
# Run inference directly in the terminal:
./llama-cli -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL

Use Docker

docker model run hf.co/ubergarm/Qwen3.5-27B-GGUF:IQ4_NL

LM Studio
Jan

vLLM

How to use ubergarm/Qwen3.5-27B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ubergarm/Qwen3.5-27B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ubergarm/Qwen3.5-27B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ubergarm/Qwen3.5-27B-GGUF:IQ4_NL

Ollama
How to use ubergarm/Qwen3.5-27B-GGUF with Ollama:
```
ollama run hf.co/ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
```

Unsloth Studio

How to use ubergarm/Qwen3.5-27B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/Qwen3.5-27B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/Qwen3.5-27B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ubergarm/Qwen3.5-27B-GGUF to start chatting

How to use ubergarm/Qwen3.5-27B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ubergarm/Qwen3.5-27B-GGUF:IQ4_NL"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ubergarm/Qwen3.5-27B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ubergarm/Qwen3.5-27B-GGUF:IQ4_NL

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use ubergarm/Qwen3.5-27B-GGUF with Docker Model Runner:
```
docker model run hf.co/ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
```

Lemonade

How to use ubergarm/Qwen3.5-27B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ubergarm/Qwen3.5-27B-GGUF:IQ4_NL

Run and chat with the model

lemonade run user.Qwen3.5-27B-GGUF-IQ4_NL

List all available models

lemonade list

Unofficial models

by Austriani - opened Mar 8

Discussion

Austriani

Mar 8

•

edited Mar 8

Hello, I want to ask you if you can make IQ4_KSS quantization for unofficial AI models. As I see you are the only one who are making IQK quantization on Hugging Face. I want someone to make IQ4_KSS (or IQ4_KT, but as I see you doing only IQK quants) for https://huggingface.co/llmfan46/Qwen3.5-27B-heretic-v2

ubergarm

Owner Mar 9

@Austriani

If you have 100GB of free disk space, you can quantize it yourself pretty quickly to any recipe you like including copy pasting my "secret recipe" for the 27B here and adjusting to IQ4_KSS or IQ4_KT etc. Both are very nice for dense models full GPU offload. I do like KT "trellis" quants for low BPW but find IQ4_KSS is more generally applicable for CPU inference and is same 4.0 BPW with similar PPL/KLD stats.

Basically:

download full bf16 safetensors from original repo
use mainline llama.cpp convert_hf_to_gguf.py
You can use the imatrix from here: https://huggingface.co/AesSedai/Qwen3.5-122B-A10B-GGUF/resolve/main/mmproj-Qwen3.5-122B-A10B-BF16.gguf
Just convert the gguf imatrix file to .dat format using mainline's llama-imatrix to convert it so that ik_llama.cpp can use it.
run llama-quantize using the .dat imatrix file and my 'secret recipe' adjusted to your liking

It doesn't take any VRAM to do this and even on a modest gaming rig won't take too long and probably less than 32GB RAM total.

Let me know if you need any help!

Austriani

Mar 9

•

edited Mar 9

@Austriani

If you have 100GB of free disk space, you can quantize it yourself pretty quickly to any recipe you like including copy pasting my "secret recipe" for the 27B here and adjusting to IQ4_KSS or IQ4_KT etc. Both are very nice for dense models full GPU offload. I do like KT "trellis" quants for low BPW but find IQ4_KSS is more generally applicable for CPU inference and is same 4.0 BPW with similar PPL/KLD stats.

Basically:

download full bf16 safetensors from original repo

use mainline llama.cpp convert_hf_to_gguf.py

You can use the imatrix from here: https://huggingface.co/AesSedai/Qwen3.5-122B-A10B-GGUF/resolve/main/mmproj-Qwen3.5-122B-A10B-BF16.gguf

Just convert the gguf imatrix file to .dat format using mainline's llama-imatrix to convert it so that ik_llama.cpp can use it.

run llama-quantize using the .dat imatrix file and my 'secret recipe' adjusted to your liking

It doesn't take any VRAM to do this and even on a modest gaming rig won't take too long and probably less than 32GB RAM total.

Let me know if you need any help!

Thank you! Do you think 32GB RAM is enough? I got only Intel Core Ultra 7 265K and 32GB 6800 MHz RAM, no dGPU. Do you think can I try to make quants for this model?

Anyways, thank you for your help, I think I will try to do it anyways.

Austriani changed discussion status to closed Mar 9

ubergarm

Owner Mar 9

•

edited Mar 9

@Austriani

I make all my quants with exactly 0 vram haha...

Yes, the most demanding (in terms of hardware and RAM) is generating the imatrix from the full size bf16 as you must be able to inference with that. But if someone else has made the imatrix for you, then the llama-quantize itself takes very little resources by comparison. A fast nvme drive is nice for the disk i/o, but if you're patient it should be fine.

If you want a very high level overview of the process you can check my recent talk: https://blog.aifoundry.org/p/adventures-in-model-quantization and i have a very old quant cookers guide (out of date) here: https://github.com/ikawrakow/ik_llama.cpp/discussions/434

I have a few commands and such too in my recent logs/ folders for more updated commands.

Let me know if you get stuck on any part, good luck!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment