Instructions to use ubergarm/Qwen3.5-27B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ubergarm/Qwen3.5-27B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ubergarm/Qwen3.5-27B-GGUF", filename="Qwen3.5-27B-IQ5_KS.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use ubergarm/Qwen3.5-27B-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL # Run inference directly in the terminal: llama cli -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL # Run inference directly in the terminal: llama cli -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL # Run inference directly in the terminal: ./llama-cli -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL # Run inference directly in the terminal: ./build/bin/llama-cli -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
Use Docker
docker model run hf.co/ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
- LM Studio
- Jan
- vLLM
How to use ubergarm/Qwen3.5-27B-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ubergarm/Qwen3.5-27B-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ubergarm/Qwen3.5-27B-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
- Ollama
How to use ubergarm/Qwen3.5-27B-GGUF with Ollama:
ollama run hf.co/ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
- Unsloth Studio
How to use ubergarm/Qwen3.5-27B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ubergarm/Qwen3.5-27B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ubergarm/Qwen3.5-27B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ubergarm/Qwen3.5-27B-GGUF to start chatting
- Pi
How to use ubergarm/Qwen3.5-27B-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ubergarm/Qwen3.5-27B-GGUF:IQ4_NL" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ubergarm/Qwen3.5-27B-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use ubergarm/Qwen3.5-27B-GGUF with Docker Model Runner:
docker model run hf.co/ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
- Lemonade
How to use ubergarm/Qwen3.5-27B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ubergarm/Qwen3.5-27B-GGUF:IQ4_NL
Run and chat with the model
lemonade run user.Qwen3.5-27B-GGUF-IQ4_NL
List all available models
lemonade list
Appraisal
Hey Ubergarm,
Quick thank you for releasing this quant.
It's scoring on my local benches equivalent to Unsloth's Q6_K on a RTX 5090 which has been critical for my own work.
Definitely goes much further in context window for same memory consumption and quicker PP+TG.
Thanks again! Keep up the amazing work.
Thanks! That is amazing to hear!
Some folks have been requesting me do an actual ik_llama.cpp quant as well as the one I released was mainline compatible experiment.
I just released a good Qwen3.5-35B-A3B if you're interested in that, but it does require ik_llama.cpp: https://huggingface.co/ubergarm/Qwen3.5-35B-A3B-GGUF/blob/main/Qwen3.5-35B-A3B-IQ4_KS.gguf
(model card has quick start to run it, your 32GB VRAM is perfect for full 256k context and mmproj with it)
Hi Ubergram,
First of all thanks for your efforts and contributions!
I have question regarding the ik4_nl quant : you wrote that it requires ik_llama.cpp, but it seems to work on later llama.cpp versions too e.g. the one that comes with recent LM Studio. I read that mainstream llama.cpp supports the native ik*_NL quants for a while , but has issues( or crashes )with CPU /RAM offloading and is not that fast.
Is it true?
Also, do the IK*_K variants yield better quality (lower PPL) , than Ik _Nl? Because I think I saw your graph for some other model, where the *_NL quant had lowest PPL at similar BPW, but on the official page, it says that_K should offer best quality.
Thanks!
Yes, mainline llama.cpp supports all of the quantization types used in this
smol-IQ4_NL 15.405 GiB (4.920 BPW). Sorry for the confusion, as typically I release mostly ik_llama.cpp only quantizations, but recently I did some mainline compatible quantizations as experiments (some types may be faster for mac or Vulkan backend).In general yes, the ik_llama.cpp exclusive quantization types tend to have lower perplexity than similar BPW quantizations from mainline llama.cpp. The guy who implemented many of the mainline types, the person ik, continued work with his own fork to provide improved types which mainline will not accept back unfortunately.
It is complex. If you are interested, I have some discussion video here: https://blog.aifoundry.org/p/adventures-in-model-quantization
For example, both iq4_nl and iq4_k are 4.5bpw implemented by the person, ik. iq4_nl uses 32 blocks per scale and iq4_k uses 256 blocks per super-block scale.
Cheers!
@ubergram
Great! Thanks very much for clarification.
AD 3 Thanks, I will watch it - and hopefully understand something :)
Yeah, the more I learn about different quants - and resulting perplexity/KLD vs actual accuracy in established benchmarks - the more complicated and nuanced the topic becomes. For instance I thought NVFP4 (for Blackwell ) or MXFP4 will be the future for limited VRAM GPUs, but it looks like it's usually way worse than FP8 for the time being. For now, it seems like some good Q4_K* or Q5_K* (and IQ_4 of course) quants can still yield significantly better accuracy overall. I'm sure you've seen that , but it's interesting:
https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-2-imatrix-works-very-well
BTW: what do you think about AWQ ? It seems like it's great idea , but unfortunately still not that popular . I guess mainly because it's not supported by llama.cpp etc.?
the more complicated and nuanced the topic becomes
yes it is a very fun rabbit hole! haha... There is a lot of noise, confusion, and misinformation on r/LocalLLaMA and HN too π
Right, I have no idea how MXFP4 became popular, as the original PR that adds it is very clear that it is only good for gpt-oss QAT models:
But don't get excited about using mxfp4 to quantize other models to fp4. The zero-bit mantissa in the block scales, along with the E2M1 choice for the 4-bit floats, results in a horrible quantization accuracy for the 4.25 bpw spent (about the same as IQ3_K), unless the model was directly trained with this specific fp4 variant (as the gpt-oss models).
https://github.com/ikawrakow/ik_llama.cpp/pull/682
Oh yes, unsloth learns a lot from me, AesSedai, and bartowski to inform and keep them improving their recipes hehe... π
BTW: what do you think about AWQ
I don't use vLLM so much as it is more of a full GPU offload multi-user optimied environment. But a guy named Phaelon on https://huggingface.co/BeaverAI discord knows a lot about it. I believe there are some AWQ quants combinations, kernels, and calibration methods that can give decent results.