out of memory at 160k

#1
by starskyzheng - opened

By using Huihui-Qwen3.6-27B-abliterated-i1-IQ4_XS-FFN-IQ3.gguf (13Gb), I fail to perform with 160k content length.
The example was Qwen3.6-27B-abliterated-i1-IQ4_XS-FFN-IQ3_S.gguf, was it the same model?

Yes, the KV needs to be set to turbo4.

Same here, cannot get that much of content length.
/path/to/llama-cpp-turboquant/build/bin/llama-server -m /path/to/models/Huihui-Qwen3.6-27B-abliterated-i1-IQ4_XS-FFN-IQ3.gguf -c 102400 -ngl 99 --flash-attn on --cache-type-k turbo4 --cache-type-v turbo4 --host 0.0.0.0

I did not try the exact threshold above 100kB, but it did fail on me for 128kB

Running on a laptop with 5080 16GB VRAM, Kubuntu 26.04

Is there another application that is using the GPU memory? My monitor uses an integrated graphics card, and an independent graphics card can be fully used for loading large models.

Emm...not sure if it because of the llama-server behave differently on Windows and Linux?
Here are something I found:

  1. The Linux server I ran actually automatically upgraded the k cache: "llama_kv_cache: auto-asymmetric: GQA ratio 6:1 (n_head=24, n_head_kv=4) โ€” upgrading K from turbo4 to q8_0 to prevent quality degradation. Disable with TURBO_AUTO_ASYMMETRIC=0"
  2. It is running 4 slots in parallel by default. Is your Windows running only 1 slot?

Anyway, with those two fixes, now I can get 160k context length.
But this actually makes me thinking, maybe I should keep the q8 K for better quality as 100k context length is already good enough for what I am trying to do ๐Ÿ˜€

So that's how it is, by setting TURBO_AUTO_ASYMMETRIC=0, the KV Cache uses the turbo4 format instead of the auto-upgraded q8_0.

Or use the released version fixed by this issue from https://github.com/lemonyins/llama-cpp-turboquant-mtp

Sign up or log in to comment