Instructions to use ubergarm/Qwen3.5-27B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ubergarm/Qwen3.5-27B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="ubergarm/Qwen3.5-27B-GGUF",
	filename="Qwen3.5-27B-IQ5_KS.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use ubergarm/Qwen3.5-27B-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
# Run inference directly in the terminal:
llama cli -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
# Run inference directly in the terminal:
llama cli -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
# Run inference directly in the terminal:
./llama-cli -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL

Use Docker

docker model run hf.co/ubergarm/Qwen3.5-27B-GGUF:IQ4_NL

LM Studio
Jan

vLLM

How to use ubergarm/Qwen3.5-27B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ubergarm/Qwen3.5-27B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ubergarm/Qwen3.5-27B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ubergarm/Qwen3.5-27B-GGUF:IQ4_NL

Ollama
How to use ubergarm/Qwen3.5-27B-GGUF with Ollama:
```
ollama run hf.co/ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
```

Unsloth Studio

How to use ubergarm/Qwen3.5-27B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/Qwen3.5-27B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for ubergarm/Qwen3.5-27B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for ubergarm/Qwen3.5-27B-GGUF to start chatting

How to use ubergarm/Qwen3.5-27B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ubergarm/Qwen3.5-27B-GGUF:IQ4_NL"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ubergarm/Qwen3.5-27B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ubergarm/Qwen3.5-27B-GGUF:IQ4_NL

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use ubergarm/Qwen3.5-27B-GGUF with Docker Model Runner:
```
docker model run hf.co/ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
```

Lemonade

How to use ubergarm/Qwen3.5-27B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull ubergarm/Qwen3.5-27B-GGUF:IQ4_NL

Run and chat with the model

lemonade run user.Qwen3.5-27B-GGUF-IQ4_NL

List all available models

lemonade list

Appraisal

by wonderfuldestruction - opened Mar 17

Discussion

wonderfuldestruction

Mar 17

Hey Ubergarm,

Quick thank you for releasing this quant.

It's scoring on my local benches equivalent to Unsloth's Q6_K on a RTX 5090 which has been critical for my own work.

Definitely goes much further in context window for same memory consumption and quicker PP+TG.

Thanks again! Keep up the amazing work.

ubergarm

Owner Mar 17

@wonderfuldestruction

Thanks! That is amazing to hear!

Some folks have been requesting me do an actual ik_llama.cpp quant as well as the one I released was mainline compatible experiment.

I just released a good Qwen3.5-35B-A3B if you're interested in that, but it does require ik_llama.cpp: https://huggingface.co/ubergarm/Qwen3.5-35B-A3B-GGUF/blob/main/Qwen3.5-35B-A3B-IQ4_KS.gguf

(model card has quick start to run it, your 32GB VRAM is perfect for full 256k context and mmproj with it)

krzysztofma

Mar 22

•

edited Mar 22

Hi Ubergram,

First of all thanks for your efforts and contributions!
I have question regarding the ik4_nl quant : you wrote that it requires ik_llama.cpp, but it seems to work on later llama.cpp versions too e.g. the one that comes with recent LM Studio. I read that mainstream llama.cpp supports the native ik*_NL quants for a while , but has issues( or crashes )with CPU /RAM offloading and is not that fast.
Is it true?
Also, do the IK*_K variants yield better quality (lower PPL) , than Ik _Nl? Because I think I saw your graph for some other model, where the *_NL quant had lowest PPL at similar BPW, but on the official page, it says that_K should offer best quality.

ubergarm

Owner Mar 22

@krzysztofma

Thanks!

Yes, mainline llama.cpp supports all of the quantization types used in this smol-IQ4_NL 15.405 GiB (4.920 BPW). Sorry for the confusion, as typically I release mostly ik_llama.cpp only quantizations, but recently I did some mainline compatible quantizations as experiments (some types may be faster for mac or Vulkan backend).
In general yes, the ik_llama.cpp exclusive quantization types tend to have lower perplexity than similar BPW quantizations from mainline llama.cpp. The guy who implemented many of the mainline types, the person ik, continued work with his own fork to provide improved types which mainline will not accept back unfortunately.
It is complex. If you are interested, I have some discussion video here: https://blog.aifoundry.org/p/adventures-in-model-quantization
For example, both iq4_nl and iq4_k are 4.5bpw implemented by the person, ik. iq4_nl uses 32 blocks per scale and iq4_k uses 256 blocks per super-block scale.

Cheers!

krzysztofma

Mar 22

@ubergram
Great! Thanks very much for clarification.

AD 3 Thanks, I will watch it - and hopefully understand something :)
Yeah, the more I learn about different quants - and resulting perplexity/KLD vs actual accuracy in established benchmarks - the more complicated and nuanced the topic becomes. For instance I thought NVFP4 (for Blackwell ) or MXFP4 will be the future for limited VRAM GPUs, but it looks like it's usually way worse than FP8 for the time being. For now, it seems like some good Q4_K* or Q5_K* (and IQ_4 of course) quants can still yield significantly better accuracy overall. I'm sure you've seen that , but it's interesting:
https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-2-imatrix-works-very-well

BTW: what do you think about AWQ ? It seems like it's great idea , but unfortunately still not that popular . I guess mainly because it's not supported by llama.cpp etc.?

ubergarm

Owner Mar 23

@krzysztofma

the more complicated and nuanced the topic becomes

yes it is a very fun rabbit hole! haha... There is a lot of noise, confusion, and misinformation on r/LocalLLaMA and HN too 😅

Right, I have no idea how MXFP4 became popular, as the original PR that adds it is very clear that it is only good for gpt-oss QAT models:

But don't get excited about using mxfp4 to quantize other models to fp4. The zero-bit mantissa in the block scales, along with the E2M1 choice for the 4-bit floats, results in a horrible quantization accuracy for the 4.25 bpw spent (about the same as IQ3_K), unless the model was directly trained with this specific fp4 variant (as the gpt-oss models).
https://github.com/ikawrakow/ik_llama.cpp/pull/682

Oh yes, unsloth learns a lot from me, AesSedai, and bartowski to inform and keep them improving their recipes hehe... 😋

BTW: what do you think about AWQ

I don't use vLLM so much as it is more of a full GPU offload multi-user optimied environment. But a guy named Phaelon on https://huggingface.co/BeaverAI discord knows a lot about it. I believe there are some AWQ quants combinations, kernels, and calibration methods that can give decent results.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment