Instructions to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3",
	filename="GLM-5.1-Abliterated-Dynamic-IQ3-340-00001-of-00008-00001-of-00018.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
# Run inference directly in the terminal:
llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
# Run inference directly in the terminal:
llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
# Run inference directly in the terminal:
./llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
# Run inference directly in the terminal:
./build/bin/llama-cli -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3

Use Docker

docker model run hf.co/sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3

LM Studio
Jan
Ollama
How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Ollama:
```
ollama run hf.co/sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
```

Unsloth Studio

How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 to start chatting

How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Docker Model Runner:
```
docker model run hf.co/sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3
```

Lemonade

How to use sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull sakamakismile/GLM-5.1-Abliterated-Dynamic-IQ3

Run and chat with the model

lemonade run user.GLM-5.1-Abliterated-Dynamic-IQ3-{{QUANT_TAG}}

List all available models

lemonade list

Thanks

by helixdouble - opened about 1 month ago

Discussion

helixdouble

about 1 month ago

Thank you for this - and for the careful attribution, it's genuinely appreciated. The dynamic-IQ3 strategy is well thought out; keeping attention and the DSA indexer at q8_0 while compressing the middle routed experts hardest is exactly the right MoE-aware approach for this architecture, and the benchmark numbers are useful data I didn't have. Glad people can actually run it now.
One thing worth sharing in the interest of honesty: v1 has a known limitation I'm working on. The healing LoRA was rank-4, which saturated too early for a 754B model, so there's measurable capability loss versus base that a properly-sized healing pass should recover. A v2 is planned. If you're interested, I'd be glad to give you a heads-up when it lands so you can re-quant from a better base - your pipeline clearly produces a clean artifact and it'd be good to keep the lineage going.

sakamakismile

Owner about 1 month ago

Thank you so much for the kind words and for sharing those insights — it genuinely means a lot coming from you. Your work on the abliteration and the FP8 base is what made this quantization possible in the first place, so the credit really belongs upstream.

Your dynamic-IQ3 feedback is encouraging; the per-tensor strategy was designed specifically around the MoE architecture you described, and I'm glad the benchmark data is useful for your own work too.

I do need to be fully transparent about the current state, though — and I'd value your eyes on this if you have any intuition. Despite the smoke-test passing on throughput (~40-51 tok/s across 6× RTX PRO 6000 Blackwell), the model currently produces garbage output on every CUDA configuration we've tested. Greedy decoding emits token 0 (!) repeatedly; sampling yields random character soup. We've traced this to what looks like an unresolved ik_llama.cpp bug affecting GLM-family models on CUDA (Issue #1045), and our Blackwell (SM120) setup may be making it worse. The Q8_0 base even segfaults on CPU-only inference, which suggests the problem isn't specific to IQ quantization.

So right now the artifact is "runnable but broken" — which is frustrating given how clean the pipeline otherwise is. We're treating this as a community debugging effort: I've posted the full reproduction matrix to the upstream issue and updated the model card with a call for help. If anyone in your network has seen GLM-DSA run correctly on ik_llama.cpp CUDA, we'd love to hear from them.

On the v2 front: yes, absolutely — please do give me a heads-up when the improved healing LoRA lands. A rank-4 pass on a 754B model is indeed tight, and I'd much rather re-quant from a properly recovered base than ship a handicapped artifact. I'm committed to keeping the lineage clean, and your v2 would be the ideal starting point for a v2 GGUF release.

In the meantime, if there's anything I can offer back: my Blackwell box is sitting here largely idle while we wait for the inference bug to shake out. If you ever need compute for evaluation, benchmarking, or stress-testing a new checkpoint across multi-GPU setups, it's yours. Consider it a small down-payment on the value your base model has already provided.

Looking forward to v2 — and to the day this model actually speaks in coherent sentences. 🙏

helixdouble

28 days ago

On the inference bug - the evidence points more toward a runtime issue than your quant, but I don't want to overstate it. Someone else got the upstream model running fine on a different setup, which suggests the base weights and the abliteration aren't the problem. The fact that even the simpler Q8_0 version crashes on the CPU is the strongest hint that such failures usually mean something below the quantisation layer, in the runtime itself. That said, it doesn't completely rule out something in your specific build path, so keeping the upstream issue thread active is still the right move.

On the compute offer - genuinely, thank you; however, I'm going to pass for now, but not because the offer isn't appreciated. v2 is in a fairly methodical phase on my end: I'm carefully and by hand doing the eval-harness work and calibration-pair sorting before the next sweep, and manually reviewing the prime pairs that come back at each rung so the refusal direction is built from clean inputs rather than a contaminated set. If a future stage genuinely calls for a GPU test on your box, I'll consider coming back to you.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment