Instructions to use noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF",
	filename="GLM-4.7-Flash-MXFP4_MOE.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE
# Run inference directly in the terminal:
llama cli -hf noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE
# Run inference directly in the terminal:
llama cli -hf noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE
# Run inference directly in the terminal:
./llama-cli -hf noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE
# Run inference directly in the terminal:
./build/bin/llama-cli -hf noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE

Use Docker

docker model run hf.co/noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE

LM Studio
Jan

vLLM

How to use noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE

Ollama
How to use noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF with Ollama:
```
ollama run hf.co/noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE
```

Unsloth Studio

How to use noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF to start chatting

How to use noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF with Docker Model Runner:
```
docker model run hf.co/noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE
```

Lemonade

How to use noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE

Run and chat with the model

lemonade run user.GLM-4.7-Flash-MXFP4_MOE-GGUF-MXFP4_MOE

List all available models

lemonade list

BF16 has looping issues

by jmander11 - opened Feb 19

Discussion

jmander11

Feb 19

I have the standard temp 1, top p 0.95, min p 0.01, repeat penalty 1 applied in open webui running with llama.cpp.
Does this need different settings? I see the first few prompts can produce gibberish until it's warmed up from something like a simple "hi" until it permanently stabilizes. More complicated things get stuck on something like:

I need to output a high-level explanation.
I need to output a high-level explanation.
I need to output a high-level summary.
I need to output a high-level explanation.
I need to output a high-level explanation.
I need to output a high-level summary.

[BOS] [BOS] [BOS] [BOS] [BOS] [BOS] [BOS] [BOS] [BOS] [BOS]

jmander11

Feb 19

Is this a problem with all 3 of these models or just the bf16?

tHe-eGoist

Feb 19

F16 and Bf16 think normally and both loop the code.
//////////////////////

wonderfuldestruction

Feb 19

•

edited Feb 19

I haven't found the model looping. Must be your sampling.

However, got a few tests on my personal repos and failed a large refactor where it keeps adding more bugs, but it's ok for light work. UD-Q6_K_XL can't fully fix the problem either, so. LM Studio's Q6_K performed best in my use so far.

noctrex

Owner Feb 19

try use the parameters for tool-calling, so that it does not loop, I've seen that the suggested parameters for general use, tend to loop
--temp 0.7
--top-p 1.0
--min-p 0.01
--repeat-penalty 1.0

noctrex

Owner Feb 19

Also I think of retiring the BF16 version, its virtually the same as F16, and unlike F16, the BF one does not run as fast on older cards.

jmander11

Feb 19

Bf16 should be a straight upgrade over F16 due to less clipping of values though, right? Optimizing for older hardware seems odd if F16 should suffer from these issues more theoretically.

I have used a Q5_K_XL with those standard settings I mentioned and got no looping issues. I can try the tool call tuning or more aggressive code style tuning.

Are all three of these quants based off of the Jan 21 re-release or whatever where bugs causing looping were fixed?

noctrex

Owner Feb 19

Yes I used the updated model for these. Seems weird that it has looping issues.
You mentioned open webui, so the model is used straight, so maybe this is the problem?
I use it mainly with llama-swap through opencode or kilo code, and haven't encountered looping issues, maybe these harnesses do something open webui does not?

tHe-eGoist

Feb 19

I am using LM Studio with the latest CUDA version.
It does not matter which settings are selected.
Temp 1.0, 0.7, 04, .
After a few lines, the same loop always appears.

noctrex

Owner Feb 19

Is it possible for you to also test the MXFP4_MOE from unsloth? it should be the same, except for my F16 version that uses F16 tensors instead of Q8.
In general this glm release does not seem very strong to me in general, it has many flaws, where the older Devstral succeeds.
Maybe its very sensitive to this FP4 quantization.

noctrex

Owner Feb 19

Bf16 should be a straight upgrade over F16 due to less clipping of values though, right? Optimizing for older hardware seems odd if F16 should suffer from these issues more theoretically.

I have used a Q5_K_XL with those standard settings I mentioned and got no looping issues. I can try the tool call tuning or more aggressive code style tuning.

Are all three of these quants based off of the Jan 21 re-release or whatever where bugs causing looping were fixed?

Yes it should, but many people with older cards will have speed hampered by over 20% with BF16 instead of FP16.
I am mimicking what unsloth does with their UD-Q8_K_XL quant, that uses F16 and Q8.

tHe-eGoist

Feb 19

I just tested the unsloth Mxfp4-Moe version. No errors.

noctrex

Owner Feb 19

Thank you for taking time from your precious schedule to confirm. Seems very weird.
Does LM Studio use a custom version of llama.cpp or the mainline version?
I do not use this program, I use llama-server together with llama-swap, so I am sorry that cannot replicate your environment right now.
I will conduct some more tests to see why this would happen.

tHe-eGoist

Feb 19

•

edited Feb 19

LM studio is always a little behind.
Currently, it's llama.cpp release b8077
(commit d612901)
I just tested your Qwen3-Coder-Next-MXFP4_MOE_BF16.
No errors.
So it can't actually be the llama build?

noctrex

Owner Feb 19

So I just updated the GLM-4.7-Flash-MXFP4_MOE GGUF, this is the same as the normal mainline MXFP4_MOE quant, as used by unsloth, but with one critical difference: The most important tensor, output.weight, is BF16 instead of Q8, so this should theoretically provide a small advantage. I will try it out later to see if this loops

tHe-eGoist

Feb 19

Nice, I'll test it right away. Thank you for your work!

noctrex

Owner Feb 19

Nice, I'll test it right away. Thank you for your work!

Thank you for reporting the issue and testing!

Shuasimodo

Feb 19

Sorry to jump in here. I have also been having issues with this model - all quants and all of these versions. I don't know if it's a quantization issue, or a sampling issue, or related to the model.
I couldn't fully get this glm flash model to work on the latest llama.cpp build properly.

But I wanted to add that I haven't been able to get anybody's quant of this model to work as well.
Ollama's is good, it works actually, but I have been having similar issues with all other variations.

Using llama-server+openwebui
Also llama-server+opencode
Also llama-server+continue.dev

(I've tried all the recommended settings)
Setting really high context (eg 131072 +) seems to make an improvement.
Setting super low temps like 0.1 or 0.25 seem to make a bit of an improvement.
But overall has not been very usable so I gave up the other day (yesterday? maybe yesterday)

I've tried unsloth, ggml, ubergarm, noctrex, and ultimately for my small model I switched back to gpt oss 20b.
Too many errors, too much confidence in incorrect responses, outputs were all messed up in opencode, I fixed the looping issues with repeat penalty but it still wasn't as good. Can't figure out what exactly the issue is either.

On the other hand, the qwen3-coder-next model has been amazing and has really taken to the increased precision well. Improvements in the output, accuracy, and overall quality have been exceptional.
Right now working on testing the exact same environment, seed and prompt to compare the f16 version to the bf16 version because I'm getting a sense that the bf16 version is generating a higher quality output. In the test I did in the other thread, almost all of them were with the bf16 version.

Anyway, my whole point is that either:
Different models are responding different to the increased precision (seems unlikely)
or maybe there's something missing in the template, or the model configuration causing these errors.

Also there's a discussion about something similar https://huggingface.co/zai-org/GLM-4.7-Flash/discussions/66

noctrex

Owner Feb 19

•

edited Feb 19

hmmm let me post my llama-server config so that you can compare:

C:/Programs/AI/llamacpp-rocm/llama-server.exe
--n-gpu-layers 999
--metrics
--jinja
--batch-size 16384
--ubatch-size 1024
--port ${PORT}
--slot-prompt-similarity 0.2
--ctx-checkpoints 128
--spec-type ngram-map-k
--draft-max 48
--ctx-size 65536
--cache-type-k q8_0
--cache-type-v q8_0
--model G:/Models/GLM-4.7-Flash-MXFP4_MOE.gguf
--temp 0.7
--top-p 1.0
--min-p 0.01
--repeat-penalty 1.0

Shuasimodo

Feb 19

•

edited Feb 19

oh... lol, there's nothing to compare.

I was using just using:
ctx-size 131072
jinja
temp 0.7
top-p 1.0
gpu-layers 99

I will try your config settings in a bit and report back. thanks for sharing.

tHe-eGoist

Feb 19

The new uploads
GLM 4.7 Flash MXFP4 MoE -Bf16: the same loop mistake.
GLM-4.7-Flash-MXFP4_MOE-GGUF: perfect, no loops

noctrex

Owner Feb 19

The new uploads
GLM 4.7 Flash MXFP4 MoE -Bf16: the same loop mistake.
GLM-4.7-Flash-MXFP4_MOE-GGUF: perfect, no loops

I find this very peculiar. As essentially the BF16 version has the original unquantized tensors in the gguf and not quantized at all.
Very interesting finding, thank you for your time testing it!

Shuasimodo

Feb 19

Just a bit of feedback, testing the f16 version.

People say it goes off the rails with just "hey", so i started there.
Also tested with asking it to make pacman in javascript, game wasn't playable, but the model didn't go off the rails or anything during generation.

Params:

[glm-4.7-flash-mxfp4.gguf]
model = /gguf_files/GLM-4.7-Flash-MXFP4_MOE_F16.gguf
n-gpu-layers = 99
n-cpu-moe = 48
ctx-size = 32768
batch-size = 16384
ubatch-size = 1024
slot-prompt-similarity = 0.2
ctx-checkpoints = 128
spec-type = ngram-map-k
draft-max = 48
cache-type-k = q8_0
cache-type-v = q8_0
repeat-penalty = 1.0
jinja = true
temp = 0.7
top-p = 1.0
flash-attn = true
sleep-idle-seconds = 3600

Don't know what half these do, but they make a huge difference in the models output.

No issues with greeting.

Kackliqur

Feb 21

•

edited Feb 21

I experienced this issue also. It looped and usually pretty quickly. The non-BF16 model didn't do this. I used the setting supplied in this thread. On strix halo.

noctrex

Owner Feb 21

•

edited Feb 21

I experienced this issue also. It looped and usually pretty quickly. The non-BF16 model didn't do this. I used the setting supplied in this thread. On strix halo.

With what inference software and/or harness did you use it?

noctrex

Owner Feb 21

If you are using llama-server, try to add the option --ctx-checkpoints 128

Kackliqur

Feb 21

•

edited Feb 23

I run locally on strix halo machine. Open WebUI, Lemonade desktop app, or directly with llamacpp. Running unsloth mxfp4 ok with this case.

Open WebUI has a nice python interpreter and live HTML rendering. I tested it using this and had looping. Looping does not happen with normal (unsloth) model or non BF16 from you.

noctrex

Owner Feb 21

Do you experience the same looping issue with the normal MXFP4 version from my repo?
I personally do not have any issues with this model, but I'm running it only for coding related tasks through opencode and kilo code.
It should be said that GLM-4.7-Flash is primarily a coding model, so it should be used in harnesses with coding in mind.
OpenWebUI is a general chat interface with no coding harness.

noctrex

Owner Feb 22

After consideration that users get looping issues, I decided to retire this F/BF16 variant. Please download the normal one.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment