alternative mxfp4 with mixed shared experts and non-experts

#1
by hugypufy - opened

i think a mixed shared experts MXFP4 (80), Q6_K (28), Q8_0 (12) instead of just Q8_0 , and non expert Q5_K, Q8_0 would be much faster than noctrex MXFP4_Q8

do you think quality will detriorate a lot as a result ?

Owner

I've made some tests with the lower quants, but they all are inferior in quality. Also, Q8_0 is quite performant. The K variants are actually slower than the simpler _0 variant.

I've made some tests with the lower quants, but they all are inferior in quality. Also, Q8_0 is quite performant. The K variants are actually slower than the simpler _0 variant.

got it , i tried a unevaluated optimisation on output.weight which makes a small change in TG , is this something you've tried already ?

https://huggingface.co/hugypufy/Test_Qwen3.6-35B-A3B-MXFP4_MOE-GGUF

I've made some tests with the lower quants, but they all are inferior in quality. Also, Q8_0 is quite performant. The K variants are actually slower than the simpler _0 variant.

got it , i tried a unevaluated optimisation on output.weight which makes a small change in TG , is this something you've tried already ?

https://huggingface.co/hugypufy/Test_Qwen3.6-35B-A3B-MXFP4_MOE-GGUF

hmm no I haven't tried that yet, thanks for testing! seems to make a difference. On what card did you test it to get these numbers?

I've made some tests with the lower quants, but they all are inferior in quality. Also, Q8_0 is quite performant. The K variants are actually slower than the simpler _0 variant.

got it , i tried a unevaluated optimisation on output.weight which makes a small change in TG , is this something you've tried already ?

https://huggingface.co/hugypufy/Test_Qwen3.6-35B-A3B-MXFP4_MOE-GGUF

hmm no I haven't tried that yet, thanks for testing! seems to make a difference. On what card did you test it to get these numbers?

underpowered (210W each) dual AMD R9700 with llama.cpp using vulkan mesa 26.0.5

Owner

I don't know how much quality will be affected getting the output weight down to FP4, maybe we could try something like Q6?

I don't know how much quality will be affected getting the output weight down to FP4, maybe we could try something like Q6?

possibly yes , let me give that a shot

I don't know how much quality will be affected getting the output weight down to FP4, maybe we could try something like Q6?

will upload soon once i get a hang of how to test for quality

image

Sign up or log in to comment