Transformers
GGUF
PyTorch
nvidia
elastic
conversational

Q8_K_L Request

#1
by sean128 - opened

Would you please consider adding Q8_K_L as a variant?

llama.cpp does not have any Q8_K_L quants but we do provide Q8_0 quants under: https://huggingface.co/mradermacher/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16-GGUF/blob/main/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16.Q8_0.gguf

Q8_0 stores everything in Q8 is already absolutely overkill and mainly exists because it runs really fast on certain hardware. Humans won't be able to tell any difference between i1-Q5_K_M and above. Even with benchmarks there is no measurable difference between Q8 and source precision. You only really see a tiny difference when measuring things such as KL divergence, perplexity and same token probability but even there the difference is crazy small and maybe even below what you would get from rounding errors on most efficient inference engines.

Sign up or log in to comment