BF16 has looping issues

#4
by jmander11 - opened

I have the standard temp 1, top p 0.95, min p 0.01, repeat penalty 1 applied in open webui running with llama.cpp.
Does this need different settings? I see the first few prompts can produce gibberish until it's warmed up from something like a simple "hi" until it permanently stabilizes. More complicated things get stuck on something like:

I need to output a high-level explanation.
I need to output a high-level explanation.
I need to output a high-level summary.
I need to output a high-level explanation.
I need to output a high-level explanation.
I need to output a high-level summary.

or

[BOS] [BOS] [BOS] [BOS] [BOS] [BOS] [BOS] [BOS] [BOS] [BOS]

Is this a problem with all 3 of these models or just the bf16?

F16 and Bf16 think normally and both loop the code.
//////////////////////

I haven't found the model looping. Must be your sampling.

However, got a few tests on my personal repos and failed a large refactor where it keeps adding more bugs, but it's ok for light work. UD-Q6_K_XL can't fully fix the problem either, so. LM Studio's Q6_K performed best in my use so far.

Owner

try use the parameters for tool-calling, so that it does not loop, I've seen that the suggested parameters for general use, tend to loop
--temp 0.7
--top-p 1.0
--min-p 0.01
--repeat-penalty 1.0

Owner

Also I think of retiring the BF16 version, its virtually the same as F16, and unlike F16, the BF one does not run as fast on older cards.

Bf16 should be a straight upgrade over F16 due to less clipping of values though, right? Optimizing for older hardware seems odd if F16 should suffer from these issues more theoretically.

I have used a Q5_K_XL with those standard settings I mentioned and got no looping issues. I can try the tool call tuning or more aggressive code style tuning.

Are all three of these quants based off of the Jan 21 re-release or whatever where bugs causing looping were fixed?

Owner

Yes I used the updated model for these. Seems weird that it has looping issues.
You mentioned open webui, so the model is used straight, so maybe this is the problem?
I use it mainly with llama-swap through opencode or kilo code, and haven't encountered looping issues, maybe these harnesses do something open webui does not?

I am using LM Studio with the latest CUDA version.
It does not matter which settings are selected.
Temp 1.0, 0.7, 04, .
After a few lines, the same loop always appears.

Owner

Is it possible for you to also test the MXFP4_MOE from unsloth? it should be the same, except for my F16 version that uses F16 tensors instead of Q8.
In general this glm release does not seem very strong to me in general, it has many flaws, where the older Devstral succeeds.
Maybe its very sensitive to this FP4 quantization.

Owner

Bf16 should be a straight upgrade over F16 due to less clipping of values though, right? Optimizing for older hardware seems odd if F16 should suffer from these issues more theoretically.

I have used a Q5_K_XL with those standard settings I mentioned and got no looping issues. I can try the tool call tuning or more aggressive code style tuning.

Are all three of these quants based off of the Jan 21 re-release or whatever where bugs causing looping were fixed?

Yes it should, but many people with older cards will have speed hampered by over 20% with BF16 instead of FP16.
I am mimicking what unsloth does with their UD-Q8_K_XL quant, that uses F16 and Q8.

I just tested the unsloth Mxfp4-Moe version. No errors.

Owner

Thank you for taking time from your precious schedule to confirm. Seems very weird.
Does LM Studio use a custom version of llama.cpp or the mainline version?
I do not use this program, I use llama-server together with llama-swap, so I am sorry that cannot replicate your environment right now.
I will conduct some more tests to see why this would happen.

LM studio is always a little behind.
Currently, it's llama.cpp release b8077
(commit d612901)
I just tested your Qwen3-Coder-Next-MXFP4_MOE_BF16.
No errors.
So it can't actually be the llama build?

Owner

So I just updated the GLM-4.7-Flash-MXFP4_MOE GGUF, this is the same as the normal mainline MXFP4_MOE quant, as used by unsloth, but with one critical difference: The most important tensor, output.weight, is BF16 instead of Q8, so this should theoretically provide a small advantage. I will try it out later to see if this loops

Nice, I'll test it right away. Thank you for your work!

Owner

Nice, I'll test it right away. Thank you for your work!

Thank you for reporting the issue and testing!

Sorry to jump in here. I have also been having issues with this model - all quants and all of these versions. I don't know if it's a quantization issue, or a sampling issue, or related to the model.
I couldn't fully get this glm flash model to work on the latest llama.cpp build properly.

But I wanted to add that I haven't been able to get anybody's quant of this model to work as well.
Ollama's is good, it works actually, but I have been having similar issues with all other variations.

Using llama-server+openwebui
Also llama-server+opencode
Also llama-server+continue.dev

(I've tried all the recommended settings)
Setting really high context (eg 131072 +) seems to make an improvement.
Setting super low temps like 0.1 or 0.25 seem to make a bit of an improvement.
But overall has not been very usable so I gave up the other day (yesterday? maybe yesterday)

I've tried unsloth, ggml, ubergarm, noctrex, and ultimately for my small model I switched back to gpt oss 20b.
Too many errors, too much confidence in incorrect responses, outputs were all messed up in opencode, I fixed the looping issues with repeat penalty but it still wasn't as good. Can't figure out what exactly the issue is either.

On the other hand, the qwen3-coder-next model has been amazing and has really taken to the increased precision well. Improvements in the output, accuracy, and overall quality have been exceptional.
Right now working on testing the exact same environment, seed and prompt to compare the f16 version to the bf16 version because I'm getting a sense that the bf16 version is generating a higher quality output. In the test I did in the other thread, almost all of them were with the bf16 version.

Anyway, my whole point is that either:
Different models are responding different to the increased precision (seems unlikely)
or maybe there's something missing in the template, or the model configuration causing these errors.

Also there's a discussion about something similar https://huggingface.co/zai-org/GLM-4.7-Flash/discussions/66

hmmm let me post my llama-server config so that you can compare:

C:/Programs/AI/llamacpp-rocm/llama-server.exe
--n-gpu-layers 999
--metrics
--jinja
--batch-size 16384
--ubatch-size 1024
--port ${PORT}
--slot-prompt-similarity 0.2
--ctx-checkpoints 128
--spec-type ngram-map-k
--draft-max 48
--ctx-size 65536
--cache-type-k q8_0
--cache-type-v q8_0
--model G:/Models/GLM-4.7-Flash-MXFP4_MOE.gguf
--temp 0.7
--top-p 1.0
--min-p 0.01
--repeat-penalty 1.0

oh... lol, there's nothing to compare.

I was using just using:
ctx-size 131072
jinja
temp 0.7
top-p 1.0
gpu-layers 99

I will try your config settings in a bit and report back. thanks for sharing.

The new uploads
GLM 4.7 Flash MXFP4 MoE -Bf16: the same loop mistake.
GLM-4.7-Flash-MXFP4_MOE-GGUF: perfect, no loops

Owner

The new uploads
GLM 4.7 Flash MXFP4 MoE -Bf16: the same loop mistake.
GLM-4.7-Flash-MXFP4_MOE-GGUF: perfect, no loops

I find this very peculiar. As essentially the BF16 version has the original unquantized tensors in the gguf and not quantized at all.
Very interesting finding, thank you for your time testing it!

Just a bit of feedback, testing the f16 version.

People say it goes off the rails with just "hey", so i started there.
Also tested with asking it to make pacman in javascript, game wasn't playable, but the model didn't go off the rails or anything during generation.

Params:

[glm-4.7-flash-mxfp4.gguf]
model = /gguf_files/GLM-4.7-Flash-MXFP4_MOE_F16.gguf
n-gpu-layers = 99
n-cpu-moe = 48
ctx-size = 32768
batch-size = 16384
ubatch-size = 1024
slot-prompt-similarity = 0.2
ctx-checkpoints = 128
spec-type = ngram-map-k
draft-max = 48
cache-type-k = q8_0
cache-type-v = q8_0
repeat-penalty = 1.0
jinja = true
temp = 0.7
top-p = 1.0
flash-attn = true
sleep-idle-seconds = 3600

Don't know what half these do, but they make a huge difference in the models output.

No issues with greeting.
image

I experienced this issue also. It looped and usually pretty quickly. The non-BF16 model didn't do this. I used the setting supplied in this thread. On strix halo.

I experienced this issue also. It looped and usually pretty quickly. The non-BF16 model didn't do this. I used the setting supplied in this thread. On strix halo.

With what inference software and/or harness did you use it?

Owner

If you are using llama-server, try to add the option --ctx-checkpoints 128

I run locally on strix halo machine. Open WebUI, Lemonade desktop app, or directly with llamacpp. Running unsloth mxfp4 ok with this case.

Open WebUI has a nice python interpreter and live HTML rendering. I tested it using this and had looping. Looping does not happen with normal (unsloth) model or non BF16 from you.

Owner

Do you experience the same looping issue with the normal MXFP4 version from my repo?
I personally do not have any issues with this model, but I'm running it only for coding related tasks through opencode and kilo code.
It should be said that GLM-4.7-Flash is primarily a coding model, so it should be used in harnesses with coding in mind.
OpenWebUI is a general chat interface with no coding harness.

Owner

After consideration that users get looping issues, I decided to retire this F/BF16 variant. Please download the normal one.

Sign up or log in to comment