Hallucinations, unstable results, tool call errors with UD-Q6_K_XL ?

#15
by tooltd - opened

I tried both versions UD-Q5_K_XL and UD-Q6_K_XL quants :
I used them for programming, and I found that UD-Q6_K_XL had a higher error rate than the UD-Q5_K_XL. It had very silly problems and unstable output. Has anyone else experienced this? @@

I have the same issue with UD-Q6_K_XL and the kilo code tool. The tool call failed even on a simple task.

I use the recommended parameter for coding task. With same prompt, when I run it repeatedly, there's one instance where the result is completely different from others, and it doesn't even comply with the prompt's requirements. 😂
It's hard to understand, I'll probably go back to Q5

Unsloth AI org

I tried both versions UD-Q5_K_XL and UD-Q6_K_XL quants :
I used them for programming, and I found that UD-Q6_K_XL had a higher error rate than the UD-Q5_K_XL. It had very silly problems and unstable output. Has anyone else experienced this? @@

I have the same issue with UD-Q6_K_XL and the kilo code tool. The tool call failed even on a simple task.

Could be because the compute is not enough. We've seen many people saying the models don't load or it just breaks because of max memory use. If you get better results with Q5 then definitely stick with the smaller one

I have also been having tool calling issues on Q8_K_XL and Q8 from Bartowski too. I don't think it is an unsloth specific problem though.
https://huggingface.co/Qwen/Qwen3.6-35B-A3B/discussions/40

Unsloth AI org

I have also been having tool calling issues on Q8_K_XL and Q8 from Bartowski too. I don't think it is an unsloth specific problem though.
https://huggingface.co/Qwen/Qwen3.6-35B-A3B/discussions/40

Uusually it's to do with the tooling you're using, not necesarilly the model or quant. What tooling are you using?

They have similar issues, unfortunatelly. Doom loop, disobeying skill instructions. 😕

the issue seems to be with the chat template. switching to the qwen3.5 chat template fixed all errors for me.

@shuasimodo would you mind to share your template reference? ☺️

sorry about that. just this file: https://huggingface.co/Qwen/Qwen3.5-35B-A3B/raw/main/chat_template.jinja
just download it and link to it as your chat template file in llama.cpp or whatever you're using for inference.

Yeah Q6_K_XL seems to take ages to load and then goes like 6.7t/s.
Works with v11 template from here: https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates

Q6_K seems at least 2x faster

Yeah Q6_K_XL seems to take ages to load and then goes like 6.7t/s.
Works with v11 template from here: https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates

Q6_K seems at least 2x faster

@MortiDahlaine

This is because of mainly improper use of f32 upcasting during quantization in an aim to achieve higher quality.

using f32 (blah blah is same is bf16 with extra 0's and is universally compatible) uses significantly more vram and creates more calculations within the gpu, and on the cpu - it's heaver, clunkier, alters the original precision, and is what is causing this "use a smaller model" nonsense. When I say alter, people can argue whatever they want - but I am saying it in the sense of bf16 is the original release, weights are not changed. f32 is upcasting, even though it's the same with just extra 0's, it's still not the same as the original bf16 because it now has extra 0's.

before anybody cries and pretends they're a ego-genius - yes, norm's, inp's, and maybe a couple others are in f32 and stay in f32. they're tiny, and critical, and also that's how llama.cpp quantizes... that's not my point. My point is that putting all ssm_* weights in f32 is uncalculated recklessness.

BF16 is the original weights, and runs on older hardware when properly quantized, and when properly quantized, creates a more stable numerical flow of calculations during inference with results in a better output.

there is this whole thing people believe without empirically testing, that is bf16 doesn't work well on older hardware - and so now we're seeing ssm weights in f32 without knowing what the pros and cons are.

Pros - better on paper (bf16 quality)
cons - significantly higher vram
cons - smaller context window
cons - altering original precision

here's the highest quality model I've been able to quantize (with the trade-off of attn_gate) from experimenting with every weight for over 2 months.

  1. ffn weights should be in the same precision for a more stable flow of data during inference. putting shexp in a higher precision than exps does not make it better. the model might seem to have more dynamic range, but it's an inaccurate numerical flow of data trying to take 16 bit decision and turn it into an 8 or 4 bit expert routing. studies show one thing, but hundreds of hours of empirical testing validate otherwise.

  2. using bf16 in the right places is not actually as slower as you'd think, but is critical to maintain the coherency, accuracy, and dynamic capability of the model.

  3. using q8_0 kv cache type is not free lunch. is it storing k and v in a quantized form right at the begining of the inference stage and influences the rest of the process and compounds over the length of the chat. This is where you'll see little errors like spelling errors, names, variables in code, etc... "negligible", sure. but if you are seeking quality and accuracy that is likely why. if running in a true production environment, give bf16 a try. then you'll be storing k and v cache values in their original form.

I have tested these weights with an rtx 2060, 6gb gpu using llama cpp.

context 102400
kv cache type f16
batch 1024
ubatch 512

and I get a stable ~100 token/sec PP with ~16 token/sec TG.

./llama.cpp/build/bin/llama-quantize \
  --output-tensor-type q8_0 \
  --token-embedding-type bf16 \
  --tensor-type output=bf16 \
  --tensor-type attn_gate=q8_0 \
  --tensor-type attn_qkv=bf16 \
  --tensor-type attn_q=bf16 \
  --tensor-type attn_k=bf16 \
  --tensor-type attn_v=bf16 \
  --tensor-type attn_output=bf16 \
  --tensor-type ssm_beta=bf16 \
  --tensor-type ssm_alpha=bf16 \
  --tensor-type ssm_out=bf16 \
  --tensor-type ffn_up_shexp=q8_0 \
  --tensor-type ffn_gate_shexp=q8_0 \
  --tensor-type ffn_down_shexp=q8_0 \
  --tensor-type ffn_up_exps=q8_0 \
  --tensor-type ffn_gate_exps=q8_0 \
  --tensor-type ffn_down_exps=q8_0 \
  /gguf_files/qwen3.6/bf16/Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf \
  /gguf_files/qwen3.6/Qwen3.6-35B-A3B-HQ.gguf \
  Q8_0

tldr of script; gate is important, but it also is your... you need to decide whether or not you want context. With gate at bf16, on 6gb vram you get 40k context. it's smart, good, but it's not really production capable with the limited context window. With setting it to q8_0, you get 102k context. but then again, unsloths q8_k_xl uses a q8_0 gate too. so I'm going to safely assume that's been well vetted and is okay to put in q8_0.

important: This is with no mmproj file. need to lower context or shift your weights around.
actually you can offload your mmproj kv cache to ram and set image-max-tokens and this will all work with vision

example:

no-mmproj-offload = true
image-max-tokens = 256  

what I'm trying to say, and validate is that bf16 is the original precision, and it does work on older hardware better than f32 for computational overhead and quality. - without altering most of the models original weights at all or upcasting to f32.


Test & results

llama_model_quantize_impl: model size = 66152.24 MiB (16.01 BPW)
llama_model_quantize_impl: quant size = 36560.05 MiB (8.85 BPW)
this seems to work well. weights are almost all in bf16, ffn in 8 bit, attn_gate is in 8 bit too.
attn_gate influences vram usage. with this model I can get 102400 context window, f16 kv cache type, and it fits.
speeds are same as earlier hq model (not added to test sorry me) ~107/sec PP with tool calls, ~17/sec TG
VRAM: 5654MiB / 6144MiB

model config:

[Qwen3.6-35b]
model = /gguf_files/qwen3.6/Qwen3.6-35B-A3B-HQ.gguf
#mmproj = /gguf_files/qwen3.6/mmproj-q8_0.gguf
alias = qwen3.6-35b
n-gpu-layers = 41
n-cpu-moe = 40
ctx-size = 102400
threads = 6
cache-type-k = f16
cache-type-v = f16
batch-size = 1024
ubatch-size = 512
flash-attn = true
sleep-idle-seconds = -1
no-kv-offload = false 
context-shift = false
chat-template-file = /home/llama-user/.config/llama/templates/qwen3.6.jinja
jinja = true
temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0.0
repeat-penalty = 1.0
presence-penalty = 0.0
load-on-startup = false
#chat-template-kwargs = {"enable_thinking":false}

GPU

$ nvidia-smi
Thu May 14 22:07:41 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.03              Driver Version: 595.58.03      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2060        Off |   00000000:23:00.0 Off |                  N/A |
| 44%   49C    P2             70W /  136W |    5656MiB /   6144MiB |     71%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A         1560439      C   ...ma.cpp/build/bin/llama-server       5652MiB |
+-----------------------------------------------------------------------------------------+

Memory usage

$ top

top - 23:34:56 up 4 days, 10:12,  1 user,  load average: 2.19, 1.65, 1.33
Tasks: 369 total,   2 running, 367 sleeping,   0 stopped,   0 zombie
%Cpu(s):  7.6 us,  1.3 sy,  0.0 ni, 90.5 id,  0.6 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31748.4 total,    237.8 free,   1293.5 used,  30859.9 buff/cache
MiB Swap:  31367.0 total,  28732.7 free,   2634.2 used.  30454.8 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                   
1560905 llama-u+  20   0   80.5g  29.9g  29.8g R  93.4  96.3  12:37.94 llama-server

important I use this gpu because it's the smallest I have access to and is the best measure of what can be run and how. Most people have 12gb, or 24gb, or more - so do the assumption math yourself and you'll have an idea of what you can run. And also this gpu is f16 only, does not support bf16. and it is evidence of the claim for slower speeds on non bf16 gpus

You can put ffn weights into 4 bit (i like mxfp4 because it keeps everything floating point and isn't some fp -> int game, and on this gpu get ~220/sec PP, and ~25/sec TG. Or q6k, or whatever. but too much use of f32 is the tldr of this rant.

If anyone is wondering why in my script I use qkv and q,k,v, it's because:
in blocks 0, 1, and 2 it uses qkv
[ 6/ 843] blk.0.attn_qkv.weight - [ 2048, 8192, 1, 1], type = bf16, size = 32.000 MiB

and starting in block 3 it moves to q,k,v. and i found at least for me that if i didn't explicitly state what those weights should be in llama.cpp it put them in q8_0 probably because of the quant type at the bottom.

[  58/ 843] blk.3.attn_k.weight                  - [  2048,    512,      1,      1], type =   bf16, size =    2.000 MiB
[  59/ 843] blk.3.attn_k_norm.weight             - [   256,      1,      1,      1], type =    f32, size =    0.001 MiB
[  60/ 843] blk.3.attn_norm.weight               - [  2048,      1,      1,      1], type =    f32, size =    0.008 MiB
[  61/ 843] blk.3.attn_output.weight             - [  4096,   2048,      1,      1], type =   bf16, size =   16.000 MiB
[  62/ 843] blk.3.attn_q.weight                  - [  2048,   8192,      1,      1], type =   bf16, size =   32.000 MiB
[  63/ 843] blk.3.attn_q_norm.weight             - [   256,      1,      1,      1], type =    f32, size =    0.001 MiB
[  64/ 843] blk.3.attn_v.weight                  - [  2048,    512,      1,      1], type =   bf16, size =    2.000 MiB

Sign up or log in to comment