Instructions to use AesSedai/GLM-4.6-Derestricted-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AesSedai/GLM-4.6-Derestricted-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="AesSedai/GLM-4.6-Derestricted-GGUF",
	filename="GLM-4.6-Derestricted-IQ4_NL/GLM-4.6-Derestricted-IQ4_NL-00001-of-00005.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use AesSedai/GLM-4.6-Derestricted-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL
# Run inference directly in the terminal:
llama-cli -hf AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL
# Run inference directly in the terminal:
llama-cli -hf AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL
# Run inference directly in the terminal:
./llama-cli -hf AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL
# Run inference directly in the terminal:
./build/bin/llama-cli -hf AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL

Use Docker

docker model run hf.co/AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL

LM Studio
Jan
Ollama
How to use AesSedai/GLM-4.6-Derestricted-GGUF with Ollama:
```
ollama run hf.co/AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL
```

Unsloth Studio new

How to use AesSedai/GLM-4.6-Derestricted-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AesSedai/GLM-4.6-Derestricted-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AesSedai/GLM-4.6-Derestricted-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for AesSedai/GLM-4.6-Derestricted-GGUF to start chatting

Pi new

How to use AesSedai/GLM-4.6-Derestricted-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use AesSedai/GLM-4.6-Derestricted-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL

Run Hermes

hermes

Docker Model Runner
How to use AesSedai/GLM-4.6-Derestricted-GGUF with Docker Model Runner:
```
docker model run hf.co/AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL
```

Lemonade

How to use AesSedai/GLM-4.6-Derestricted-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL

Run and chat with the model

lemonade run user.GLM-4.6-Derestricted-GGUF-IQ4_NL

List all available models

lemonade list

Quick update: I've fixed an issue where the chat template wasn't inluded in the quants, the first shard of each quant has been updated to include the chat template. Please re-download the first shard to pick up the fix, sorry for the inconvenience.

This is a "derestricted" abliteration of GLM-4.6, using Jim Lai's norm-preserving biprojected abliteration technique. For more information, you can read his blog post here

Essentially, I was going for a lighter abliteration. This doesn't mean the model is 100% zero-shot unrestricted. It should be more "permissive" than normal GLM-4.6, but probably still requires a system prompt to nudge it in the right direction. From my own testing, I've mainly used this model for creative writing. I've noticed a positive change compared to how base GLM-4.6 does sentence structure and this feels more varied and organic. It does not particularly reduce or alter "slop", since this isn't a finetune, but there's much less of an "assistant" voice performing soft-censorship during particular scenarios and it feels less like "LLM writing". I've only done some light technical assistant work and it still feels competent there, but I haven't exhaustively benched it.

Visualized here is the analysis of the refusal direction:

Provided in this repository are several quants I've produced from the abliteration I performed, as well as the measurements to produce your own abliteration if you want and the config that I used. I chose to ablate layers 30-45, using the measurement from layer 37 due to the SNR peak. Other measurements I tried showed an interesting dual-peak phenomenon with a second peak forming around layer 46, but the overall SNR magitude was only ~0.16 or so compared to the much better 0.25 peak present here.

If you want to abliterate GLM-4.6 yourself, you will need to download the safetensors for the model and use this PR.

For quants, I've provided a Q8_0 as well as others that follow the MoE quantization schema that I've been using. The idea being that given the huge size of the FFN tensors compared to the rest of the tensors in the model, it should be possible to achieve a better quality while keeping the overall size of the entire model smaller compared to a similar naive quantization.

The naming convention is as follows: [Default Type]-[FFN_UP]-[FFN_GATE]-[FFN_DOWN], eg: Q8_0-Q4_K-Q4_K-Q5_K. This means:

Q8_0 is the default type (attention, shared expert, etc.)
Q4_K was used for the FFN_UP and FFN_GATE conditional expert tensors
Q5_K was used for the FFN_DOWN conditional expert tensors

Quant	Size	PPL	KLD
Q8_0	353.26 GiB (8.51 BPW)	8.4801 ± 0.15099	0
Q8_0-Q5_K-Q5_K-Q6_K	248.61 GiB (5.99 BPW)	8.4881 ± 0.15112	0.009449 ± 0.000677
Q8_0-Q4_K-Q4_K-Q5_K	208.24 GiB (5.01 BPW)	8.5182 ± 0.15172	0.016299 ± 0.000839
IQ4_NL	187.40 GiB (4.51 BPW)	8.6026 ± 0.15331	0.029524 ± 0.000858
Q8_0-IQ3_S-IQ3_S-IQ4_XS	163.74 GiB (3.94 BPW)	8.7101 ± 0.15534	0.041096 ± 0.001202
Q6_K-IQ2_XS-IQ2_XS-IQ3_S	119.79 GiB (2.88 BPW)	9.3447 ± 0.16732	0.131974 ± 0.002384
Q5_K-IQ2_XXS-IQ2_XXS-IQ3_XXS	106.50 GiB (2.56 BPW)	9.5127 ± 0.17040	0.174152 ± 0.002976