Instructions to use michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF", filename="Qwopus3.6-27B-v2-MTP-NVFP4-GGUF.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF:NVFP4 # Run inference directly in the terminal: llama-cli -hf michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF:NVFP4
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF:NVFP4 # Run inference directly in the terminal: llama-cli -hf michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF:NVFP4
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF:NVFP4 # Run inference directly in the terminal: ./llama-cli -hf michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF:NVFP4
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF:NVFP4 # Run inference directly in the terminal: ./build/bin/llama-cli -hf michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF:NVFP4
Use Docker
docker model run hf.co/michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF:NVFP4
- LM Studio
- Jan
- Ollama
How to use michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF with Ollama:
ollama run hf.co/michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF:NVFP4
- Unsloth Studio
How to use michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF to start chatting
- Pi
How to use michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF:NVFP4
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF:NVFP4" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF:NVFP4
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF:NVFP4
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF with Docker Model Runner:
docker model run hf.co/michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF:NVFP4
- Lemonade
How to use michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF:NVFP4
Run and chat with the model
lemonade run user.Qwopus3.6-27B-v2-MTP-NVFP4-GGUF-NVFP4
List all available models
lemonade list
This is the second release of the NVFP4 version of Jackrong's Qwopus3.6-27B-v2-MTP-GGUF.
9-June-2026: Fixed MTP heads to NVFP4; performance with MTP is much improved (108tk/s tg)
Please note, I am not affiliated; this is my own quantization effort made with my experimental work-in-progress advanced-gguf-quantizer.
More evaluatlions will be underway, this page will be updated when those are complete.
Feedback on how to improve the quantizer/this quantization is appreciated.
For improved performance and quality, try my llama.cpp NVFP4-Repack with MXFP6 from:
https://github.com/michaelw9999/llama.cpp/tree/nvfp4repack_mxfp6_cuda
These branches are updated regularly.
NVFP4 repack preloads all tensors into a CUDA tile to boost speed. It is a tiny bit slower on first load, then provides ~10% prefill boost with a small reduction in token gen seen on larger models, and an increase on smaller models.
However, it also enables NVFP4 input scale, which boosts model correctness.
Initial performance results:
llama-bench on 5090:
qwen35 27B NVFP4 | 15.14 GiB |27.32 B | CUDA | pp512 | 5958.06 ± 6.46 |
qwen35 27B NVFP4 | 15.14 GiB |27.32 B | CUDA | tg128 | 73.72 ± 0.08 |
Perplexity/kld results against wiki2 test:
====== Perplexity statistics ======
Mean PPL(Q) : 6.959259 ± 0.045420
Mean PPL(base) : 6.694312 ± 0.042968
Cor(ln(PPL(Q)), ln(PPL(base))): 98.93%
Mean ln(PPL(Q)/PPL(base)) : 0.038815 ± 0.000953
Mean PPL(Q)/PPL(base) : 1.039578 ± 0.000991
Mean PPL(Q)-PPL(base) : 0.264947 ± 0.006915
====== KL divergence statistics ======
Mean KLD: 0.045311 ± 0.000650
Maximum KLD: 19.236860
99.9% KLD: 2.298393
99.0% KLD: 0.423422
95.0% KLD: 0.135200
90.0% KLD: 0.081522
Median KLD: 0.017625
10.0% KLD: 0.000521
5.0% KLD: 0.000152
1.0% KLD: 0.000024
0.1% KLD: 0.000004
Minimum KLD: -0.000062
====== Token probability statistics ======
Mean Δp: -0.450 ± 0.016 %
Maximum Δp: 99.955%
99.9% Δp: 32.792%
99.0% Δp: 13.999%
95.0% Δp: 6.515%
90.0% Δp: 3.788%
75.0% Δp: 0.717%
Median Δp: -0.012%
25.0% Δp: -1.153%
10.0% Δp: -4.684%
5.0% Δp: -8.120%
1.0% Δp: -20.878%
0.1% Δp: -55.713%
Minimum Δp: -99.221%
RMS Δp : 6.022 ± 0.055 %
Same top p: 91.085 ± 0.074 %
- Downloads last month
- 2,466
4-bit