This is an proof of concept/work in progress Qwen3.6-35B-A3B quantized into MXFP6.
It was quantized with my still experimental advanced-gguf-quantizer tool.
This GGUF will ONLY work on llama.cpp. The CPU only PR is posted here:

https://github.com/ggml-org/llama.cpp/pull/22671

The PR runs very slowly because that is for the initial implementation without GPU support.

You may preview the very fast POC CUDA version from my fork:

https://github.com/michaelw9999/llama.cpp/tree/mxfp6-cuda

To merge into your existing llama.cpp installation:

git remote add mxfp6 https://github.com/michaelw9999/llama.cpp
git fetch mxfp6
git merge mxfp6/mxfp6-cuda
cmake -B build -DGGML_CUDA=ON
cmake --build build -j

Or to install fresh:

git clone -b mxfp6-cuda https://github.com/michaelw9999/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build -j

NOTICE:

This is my own work and is experimental and unofficial.

The CUDA version is not part of any llama.cpp PR (yet). This is not associated with NVIDIA in anyway.

Very likely, any future MXFP6 design will not be compatible with this implementation.

For Qwen3.6-35B, MXFP6 is almost as fast as NVFP4 on prefill.

Using FP8 for activations, it is faster than NVFP4 on tokengen.

Feedback is both requested and encouraged so I can make further improvements into future llama.cpp PRs.

The NVFP4/MXFP6 quantizer is still being improved and will be posted in the future. Please let me know if you want to see a specific model turned into MXFP6.

I will create an MTP enabled version soon.

MXFP6: Final estimate: PPL = 6.7890 +/- 0.04420

  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32606 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B MXFP6 - E2M3 |  26.46 GiB |    34.66 B | CUDA       |  99 |           pp512 |      8094.43 ± 49.53 |
| qwen35moe 35B.A3B MXFP6 - E2M3 |  26.46 GiB |    34.66 B | CUDA       |  99 |           tg128 |        188.10 ± 3.20 |
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32606 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B NVFP4        |  21.48 GiB |    34.66 B | CUDA       |  99 |           pp512 |      8220.18 ± 57.89 |
| qwen35moe 35B.A3B NVFP4        |  21.48 GiB |    34.66 B | CUDA       |  99 |           tg128 |        159.53 ± 0.82 |
Downloads last month
305
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for michaelw9999/Qwen3.6-35B-A3B-MXFP6-GGUF

Quantized
(473)
this model

Collection including michaelw9999/Qwen3.6-35B-A3B-MXFP6-GGUF