Instructions to use noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF", filename="GLM-4.7-Flash-MXFP4_MOE.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE # Run inference directly in the terminal: llama cli -hf noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE # Run inference directly in the terminal: llama cli -hf noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE # Run inference directly in the terminal: ./llama-cli -hf noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE # Run inference directly in the terminal: ./build/bin/llama-cli -hf noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE
Use Docker
docker model run hf.co/noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE
- LM Studio
- Jan
- vLLM
How to use noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE
- Ollama
How to use noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF with Ollama:
ollama run hf.co/noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE
- Unsloth Studio
How to use noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF to start chatting
- Pi
How to use noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF with Docker Model Runner:
docker model run hf.co/noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE
- Lemonade
How to use noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF:MXFP4_MOE
Run and chat with the model
lemonade run user.GLM-4.7-Flash-MXFP4_MOE-GGUF-MXFP4_MOE
List all available models
lemonade list
BF16 has looping issues
I have the standard temp 1, top p 0.95, min p 0.01, repeat penalty 1 applied in open webui running with llama.cpp.
Does this need different settings? I see the first few prompts can produce gibberish until it's warmed up from something like a simple "hi" until it permanently stabilizes. More complicated things get stuck on something like:
I need to output a high-level explanation.
I need to output a high-level explanation.
I need to output a high-level summary.
I need to output a high-level explanation.
I need to output a high-level explanation.
I need to output a high-level summary.
or
[BOS] [BOS] [BOS] [BOS] [BOS] [BOS] [BOS] [BOS] [BOS] [BOS]
Is this a problem with all 3 of these models or just the bf16?
F16 and Bf16 think normally and both loop the code.
//////////////////////
I haven't found the model looping. Must be your sampling.
However, got a few tests on my personal repos and failed a large refactor where it keeps adding more bugs, but it's ok for light work. UD-Q6_K_XL can't fully fix the problem either, so. LM Studio's Q6_K performed best in my use so far.
try use the parameters for tool-calling, so that it does not loop, I've seen that the suggested parameters for general use, tend to loop
--temp 0.7
--top-p 1.0
--min-p 0.01
--repeat-penalty 1.0
Also I think of retiring the BF16 version, its virtually the same as F16, and unlike F16, the BF one does not run as fast on older cards.
Bf16 should be a straight upgrade over F16 due to less clipping of values though, right? Optimizing for older hardware seems odd if F16 should suffer from these issues more theoretically.
I have used a Q5_K_XL with those standard settings I mentioned and got no looping issues. I can try the tool call tuning or more aggressive code style tuning.
Are all three of these quants based off of the Jan 21 re-release or whatever where bugs causing looping were fixed?
Yes I used the updated model for these. Seems weird that it has looping issues.
You mentioned open webui, so the model is used straight, so maybe this is the problem?
I use it mainly with llama-swap through opencode or kilo code, and haven't encountered looping issues, maybe these harnesses do something open webui does not?
I am using LM Studio with the latest CUDA version.
It does not matter which settings are selected.
Temp 1.0, 0.7, 04, .
After a few lines, the same loop always appears.
Is it possible for you to also test the MXFP4_MOE from unsloth? it should be the same, except for my F16 version that uses F16 tensors instead of Q8.
In general this glm release does not seem very strong to me in general, it has many flaws, where the older Devstral succeeds.
Maybe its very sensitive to this FP4 quantization.
Bf16 should be a straight upgrade over F16 due to less clipping of values though, right? Optimizing for older hardware seems odd if F16 should suffer from these issues more theoretically.
I have used a Q5_K_XL with those standard settings I mentioned and got no looping issues. I can try the tool call tuning or more aggressive code style tuning.
Are all three of these quants based off of the Jan 21 re-release or whatever where bugs causing looping were fixed?
Yes it should, but many people with older cards will have speed hampered by over 20% with BF16 instead of FP16.
I am mimicking what unsloth does with their UD-Q8_K_XL quant, that uses F16 and Q8.
I just tested the unsloth Mxfp4-Moe version. No errors.
Thank you for taking time from your precious schedule to confirm. Seems very weird.
Does LM Studio use a custom version of llama.cpp or the mainline version?
I do not use this program, I use llama-server together with llama-swap, so I am sorry that cannot replicate your environment right now.
I will conduct some more tests to see why this would happen.
LM studio is always a little behind.
Currently, it's llama.cpp release b8077
(commit d612901)
I just tested your Qwen3-Coder-Next-MXFP4_MOE_BF16.
No errors.
So it can't actually be the llama build?
So I just updated the GLM-4.7-Flash-MXFP4_MOE GGUF, this is the same as the normal mainline MXFP4_MOE quant, as used by unsloth, but with one critical difference: The most important tensor, output.weight, is BF16 instead of Q8, so this should theoretically provide a small advantage. I will try it out later to see if this loops
Nice, I'll test it right away. Thank you for your work!
Nice, I'll test it right away. Thank you for your work!
Thank you for reporting the issue and testing!
Sorry to jump in here. I have also been having issues with this model - all quants and all of these versions. I don't know if it's a quantization issue, or a sampling issue, or related to the model.
I couldn't fully get this glm flash model to work on the latest llama.cpp build properly.
But I wanted to add that I haven't been able to get anybody's quant of this model to work as well.
Ollama's is good, it works actually, but I have been having similar issues with all other variations.
Using llama-server+openwebui
Also llama-server+opencode
Also llama-server+continue.dev
(I've tried all the recommended settings)
Setting really high context (eg 131072 +) seems to make an improvement.
Setting super low temps like 0.1 or 0.25 seem to make a bit of an improvement.
But overall has not been very usable so I gave up the other day (yesterday? maybe yesterday)
I've tried unsloth, ggml, ubergarm, noctrex, and ultimately for my small model I switched back to gpt oss 20b.
Too many errors, too much confidence in incorrect responses, outputs were all messed up in opencode, I fixed the looping issues with repeat penalty but it still wasn't as good. Can't figure out what exactly the issue is either.
On the other hand, the qwen3-coder-next model has been amazing and has really taken to the increased precision well. Improvements in the output, accuracy, and overall quality have been exceptional.
Right now working on testing the exact same environment, seed and prompt to compare the f16 version to the bf16 version because I'm getting a sense that the bf16 version is generating a higher quality output. In the test I did in the other thread, almost all of them were with the bf16 version.
Anyway, my whole point is that either:
Different models are responding different to the increased precision (seems unlikely)
or maybe there's something missing in the template, or the model configuration causing these errors.
Also there's a discussion about something similar https://huggingface.co/zai-org/GLM-4.7-Flash/discussions/66
hmmm let me post my llama-server config so that you can compare:
C:/Programs/AI/llamacpp-rocm/llama-server.exe
--n-gpu-layers 999
--metrics
--jinja
--batch-size 16384
--ubatch-size 1024
--port ${PORT}
--slot-prompt-similarity 0.2
--ctx-checkpoints 128
--spec-type ngram-map-k
--draft-max 48
--ctx-size 65536
--cache-type-k q8_0
--cache-type-v q8_0
--model G:/Models/GLM-4.7-Flash-MXFP4_MOE.gguf
--temp 0.7
--top-p 1.0
--min-p 0.01
--repeat-penalty 1.0
oh... lol, there's nothing to compare.
I was using just using:
ctx-size 131072
jinja
temp 0.7
top-p 1.0
gpu-layers 99
I will try your config settings in a bit and report back. thanks for sharing.
The new uploads
GLM 4.7 Flash MXFP4 MoE -Bf16: the same loop mistake.
GLM-4.7-Flash-MXFP4_MOE-GGUF: perfect, no loops
The new uploads
GLM 4.7 Flash MXFP4 MoE -Bf16: the same loop mistake.
GLM-4.7-Flash-MXFP4_MOE-GGUF: perfect, no loops
I find this very peculiar. As essentially the BF16 version has the original unquantized tensors in the gguf and not quantized at all.
Very interesting finding, thank you for your time testing it!
Just a bit of feedback, testing the f16 version.
People say it goes off the rails with just "hey", so i started there.
Also tested with asking it to make pacman in javascript, game wasn't playable, but the model didn't go off the rails or anything during generation.
Params:
[glm-4.7-flash-mxfp4.gguf]
model = /gguf_files/GLM-4.7-Flash-MXFP4_MOE_F16.gguf
n-gpu-layers = 99
n-cpu-moe = 48
ctx-size = 32768
batch-size = 16384
ubatch-size = 1024
slot-prompt-similarity = 0.2
ctx-checkpoints = 128
spec-type = ngram-map-k
draft-max = 48
cache-type-k = q8_0
cache-type-v = q8_0
repeat-penalty = 1.0
jinja = true
temp = 0.7
top-p = 1.0
flash-attn = true
sleep-idle-seconds = 3600
Don't know what half these do, but they make a huge difference in the models output.
I experienced this issue also. It looped and usually pretty quickly. The non-BF16 model didn't do this. I used the setting supplied in this thread. On strix halo.
I experienced this issue also. It looped and usually pretty quickly. The non-BF16 model didn't do this. I used the setting supplied in this thread. On strix halo.
With what inference software and/or harness did you use it?
If you are using llama-server, try to add the option --ctx-checkpoints 128
I run locally on strix halo machine. Open WebUI, Lemonade desktop app, or directly with llamacpp. Running unsloth mxfp4 ok with this case.
Open WebUI has a nice python interpreter and live HTML rendering. I tested it using this and had looping. Looping does not happen with normal (unsloth) model or non BF16 from you.
Do you experience the same looping issue with the normal MXFP4 version from my repo?
I personally do not have any issues with this model, but I'm running it only for coding related tasks through opencode and kilo code.
It should be said that GLM-4.7-Flash is primarily a coding model, so it should be used in harnesses with coding in mind.
OpenWebUI is a general chat interface with no coding harness.
After consideration that users get looping issues, I decided to retire this F/BF16 variant. Please download the normal one.
