Instructions to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3", filename="GLM-5.1-Abliterated-Dynamic-IQ3-340-00001-of-00008-00001-of-00018.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 # Run inference directly in the terminal: llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 # Run inference directly in the terminal: llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 # Run inference directly in the terminal: ./llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 # Run inference directly in the terminal: ./build/bin/llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Use Docker
docker model run hf.co/sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
- LM Studio
- Jan
- Ollama
How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Ollama:
ollama run hf.co/sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
- Unsloth Studio
How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 to start chatting
- Pi
How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Docker Model Runner:
docker model run hf.co/sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
- Lemonade
How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
Run and chat with the model
lemonade run user.GLM-5.1-Abliterated-Dynamic-IQ3-{{QUANT_TAG}}List all available models
lemonade list
Thanks
Thank you for this - and for the careful attribution, it's genuinely appreciated. The dynamic-IQ3 strategy is well thought out; keeping attention and the DSA indexer at q8_0 while compressing the middle routed experts hardest is exactly the right MoE-aware approach for this architecture, and the benchmark numbers are useful data I didn't have. Glad people can actually run it now.
One thing worth sharing in the interest of honesty: v1 has a known limitation I'm working on. The healing LoRA was rank-4, which saturated too early for a 754B model, so there's measurable capability loss versus base that a properly-sized healing pass should recover. A v2 is planned. If you're interested, I'd be glad to give you a heads-up when it lands so you can re-quant from a better base - your pipeline clearly produces a clean artifact and it'd be good to keep the lineage going.
Thank you so much for the kind words and for sharing those insights β it genuinely means a lot coming from you. Your work on the abliteration and the FP8 base is what made this quantization possible in the first place, so the credit really belongs upstream.
Your dynamic-IQ3 feedback is encouraging; the per-tensor strategy was designed specifically around the MoE architecture you described, and I'm glad the benchmark data is useful for your own work too.
I do need to be fully transparent about the current state, though β and I'd value your eyes on this if you have any intuition. Despite the smoke-test passing on throughput (~40-51 tok/s across 6Γ RTX PRO 6000 Blackwell), the model currently produces garbage output on every CUDA configuration we've tested. Greedy decoding emits token 0 (!) repeatedly; sampling yields random character soup. We've traced this to what looks like an unresolved ik_llama.cpp bug affecting GLM-family models on CUDA (Issue #1045), and our Blackwell (SM120) setup may be making it worse. The Q8_0 base even segfaults on CPU-only inference, which suggests the problem isn't specific to IQ quantization.
So right now the artifact is "runnable but broken" β which is frustrating given how clean the pipeline otherwise is. We're treating this as a community debugging effort: I've posted the full reproduction matrix to the upstream issue and updated the model card with a call for help. If anyone in your network has seen GLM-DSA run correctly on ik_llama.cpp CUDA, we'd love to hear from them.
On the v2 front: yes, absolutely β please do give me a heads-up when the improved healing LoRA lands. A rank-4 pass on a 754B model is indeed tight, and I'd much rather re-quant from a properly recovered base than ship a handicapped artifact. I'm committed to keeping the lineage clean, and your v2 would be the ideal starting point for a v2 GGUF release.
In the meantime, if there's anything I can offer back: my Blackwell box is sitting here largely idle while we wait for the inference bug to shake out. If you ever need compute for evaluation, benchmarking, or stress-testing a new checkpoint across multi-GPU setups, it's yours. Consider it a small down-payment on the value your base model has already provided.
Looking forward to v2 β and to the day this model actually speaks in coherent sentences. π
On the inference bug - the evidence points more toward a runtime issue than your quant, but I don't want to overstate it. Someone else got the upstream model running fine on a different setup, which suggests the base weights and the abliteration aren't the problem. The fact that even the simpler Q8_0 version crashes on the CPU is the strongest hint that such failures usually mean something below the quantisation layer, in the runtime itself. That said, it doesn't completely rule out something in your specific build path, so keeping the upstream issue thread active is still the right move.
On the compute offer - genuinely, thank you; however, I'm going to pass for now, but not because the offer isn't appreciated. v2 is in a fairly methodical phase on my end: I'm carefully and by hand doing the eval-harness work and calibration-pair sorting before the next sweep, and manually reviewing the prime pairs that come back at each rung so the refusal direction is built from clean inputs rather than a contaminated set. If a future stage genuinely calls for a GPU test on your box, I'll consider coming back to you.