Instructions to use AesSedai/GLM-4.6-Derestricted-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use AesSedai/GLM-4.6-Derestricted-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="AesSedai/GLM-4.6-Derestricted-GGUF", filename="GLM-4.6-Derestricted-IQ4_NL/GLM-4.6-Derestricted-IQ4_NL-00001-of-00005.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use AesSedai/GLM-4.6-Derestricted-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL # Run inference directly in the terminal: llama-cli -hf AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL # Run inference directly in the terminal: llama-cli -hf AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL # Run inference directly in the terminal: ./llama-cli -hf AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL # Run inference directly in the terminal: ./build/bin/llama-cli -hf AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL
Use Docker
docker model run hf.co/AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL
- LM Studio
- Jan
- Ollama
How to use AesSedai/GLM-4.6-Derestricted-GGUF with Ollama:
ollama run hf.co/AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL
- Unsloth Studio new
How to use AesSedai/GLM-4.6-Derestricted-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AesSedai/GLM-4.6-Derestricted-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AesSedai/GLM-4.6-Derestricted-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for AesSedai/GLM-4.6-Derestricted-GGUF to start chatting
- Pi new
How to use AesSedai/GLM-4.6-Derestricted-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use AesSedai/GLM-4.6-Derestricted-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL
Run Hermes
hermes
- Docker Model Runner
How to use AesSedai/GLM-4.6-Derestricted-GGUF with Docker Model Runner:
docker model run hf.co/AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL
- Lemonade
How to use AesSedai/GLM-4.6-Derestricted-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull AesSedai/GLM-4.6-Derestricted-GGUF:IQ4_NL
Run and chat with the model
lemonade run user.GLM-4.6-Derestricted-GGUF-IQ4_NL
List all available models
lemonade list
Quick update: I've fixed an issue where the chat template wasn't inluded in the quants, the first shard of each quant has been updated to include the chat template. Please re-download the first shard to pick up the fix, sorry for the inconvenience.
This is a "derestricted" abliteration of GLM-4.6, using Jim Lai's norm-preserving biprojected abliteration technique. For more information, you can read his blog post here
Essentially, I was going for a lighter abliteration. This doesn't mean the model is 100% zero-shot unrestricted. It should be more "permissive" than normal GLM-4.6, but probably still requires a system prompt to nudge it in the right direction. From my own testing, I've mainly used this model for creative writing. I've noticed a positive change compared to how base GLM-4.6 does sentence structure and this feels more varied and organic. It does not particularly reduce or alter "slop", since this isn't a finetune, but there's much less of an "assistant" voice performing soft-censorship during particular scenarios and it feels less like "LLM writing". I've only done some light technical assistant work and it still feels competent there, but I haven't exhaustively benched it.
Visualized here is the analysis of the refusal direction:

Provided in this repository are several quants I've produced from the abliteration I performed, as well as the measurements to produce your own abliteration if you want and the config that I used. I chose to ablate layers 30-45, using the measurement from layer 37 due to the SNR peak. Other measurements I tried showed an interesting dual-peak phenomenon with a second peak forming around layer 46, but the overall SNR magitude was only ~0.16 or so compared to the much better 0.25 peak present here.
If you want to abliterate GLM-4.6 yourself, you will need to download the safetensors for the model and use this PR.
For quants, I've provided a Q8_0 as well as others that follow the MoE quantization schema that I've been using. The idea being that given the huge size of the FFN tensors compared to the rest of the tensors in the model, it should be possible to achieve a better quality while keeping the overall size of the entire model smaller compared to a similar naive quantization.
The naming convention is as follows: [Default Type]-[FFN_UP]-[FFN_GATE]-[FFN_DOWN], eg: Q8_0-Q4_K-Q4_K-Q5_K. This means:
- Q8_0 is the default type (attention, shared expert, etc.)
- Q4_K was used for the FFN_UP and FFN_GATE conditional expert tensors
- Q5_K was used for the FFN_DOWN conditional expert tensors
| Quant | Size | PPL | KLD |
|---|---|---|---|
| Q8_0 | 353.26 GiB (8.51 BPW) | 8.4801 ยฑ 0.15099 | 0 |
| Q8_0-Q5_K-Q5_K-Q6_K | 248.61 GiB (5.99 BPW) | 8.4881 ยฑ 0.15112 | 0.009449 ยฑ 0.000677 |
| Q8_0-Q4_K-Q4_K-Q5_K | 208.24 GiB (5.01 BPW) | 8.5182 ยฑ 0.15172 | 0.016299 ยฑ 0.000839 |
| IQ4_NL | 187.40 GiB (4.51 BPW) | 8.6026 ยฑ 0.15331 | 0.029524 ยฑ 0.000858 |
| Q8_0-IQ3_S-IQ3_S-IQ4_XS | 163.74 GiB (3.94 BPW) | 8.7101 ยฑ 0.15534 | 0.041096 ยฑ 0.001202 |
| Q6_K-IQ2_XS-IQ2_XS-IQ3_S | 119.79 GiB (2.88 BPW) | 9.3447 ยฑ 0.16732 | 0.131974 ยฑ 0.002384 |
| Q5_K-IQ2_XXS-IQ2_XXS-IQ3_XXS | 106.50 GiB (2.56 BPW) | 9.5127 ยฑ 0.17040 | 0.174152 ยฑ 0.002976 |
- Downloads last month
- 507
