MTP Support

#13
by anikifoss - opened

As broadcasted in LocalLLaMA, there is now MTP support in llama.cpp!

Any chance we can get MTP layers for Kimi-K2.6 added to the first GGUF?

srv    load_model: creating MTP draft context against the target model '/models/ubergarm/Kimi-K2.6-Q4_X-GGUF/Kimi-K2.6-Q4_X-00001-of-00014.gguf'
llama_init_from_model: context type MTP requested but model doesn't contain MTP layers
srv    load_model: failed to create MTP context

@anikifoss

MTP is going gang busters lately on all the models!

I'm not 100% sure how all the inference engines are doing it, e.g.

  1. do i have to re-convert the entire model and preserve the MTP tensors using the latest mainline llama.cpp convert_hf_to_gguf.py and quantize again?
  2. can I just extract the MTP layer/tensors into a seperate GGUF and pass it in to llama-server at runtime (ik and mainline might handle this differently now, its been moving fast and hard to keep up)
  3. might be able to do #2 but kinda modified and only need to download the 1st small metadata GGUF and the new MTP gguf and the new metadata will point to the new MTP GGUF

Unfortunately, the big rigs of Wendell's are down for some maintenance right now, but I'll keep my eye on this.

I see @tarruda is going back trying to add MTP to big Qwen3.5 and has some notes here:

Apparently there's a new --mtp flag on convert_hf_to_gguf.py to create a new GGUF with only the MTP model, avoiding having to recreate the original GGUF...
https://huggingface.co/AesSedai/Qwen3.5-397B-A17B-GGUF/discussions/10#6a091ab9a80e0113bb2e0868

EDIT Also keep an eye on this new ik PR 1821 which may add -sm graph for MLA models like Kimi 🀞

I have the same questions, and I've assessed that it will probably be faster for me to just redo everything from scratch than offer both options and add the missing tensors.

Kimi

Planned retest on a 16Γ—24GB rig once it's back online

Lol. I take it there's no benefit if half the model is running on the CPU?

can I just extract the MTP layer/tensors into a seperate GGUF and pass it in to llama-server at runtime

I like this idea, like mmproj.

I generated MTP version of Qwen 3.5 397b but IMO it was not worth it if you are memory constrained: When the MTP layers are quantized to Q8_0, the weights grew in size by about 6G. Also, the extra RAM was being used despite not enabling MTP, though this could be a temporary llama.cpp issue.

I generated MTP version of Qwen 3.5 397b but IMO it was not worth it if you are memory constrained: When the MTP layers are quantized to Q8_0, the weights grew in size by about 6G. Also, the extra RAM was being used despite not enabling MTP, though this could be a temporary llama.cpp issue.

Isn't there an option in llama to disable these tensors and not have them loaded in any way? I was under the impression people working on the MTP implementation already thought of this, as this is in my view the most sensible thing to do and also quite straightforward to implement.

@ubergarm , these options are available with convert_hf_to_gguf.py:

https://github.com/ggml-org/llama.cpp/pull/22673/changes#diff-ec77d8003b92ff283179456d36b8b56abf635e7b1232e70daf16676e8920ccf1R120-R126

    parser.add_argument(
        "--mtp", action="store_true",
        help="(Experimental) Export only the multi-token prediction (MTP) head as a separate GGUF, suitable for use as a speculative draft. Output file name will get a '-MTP' suffix.",
    )
    parser.add_argument(
        "--no-mtp", action="store_true",
        help="(Experimental) Exclude the multi-token prediction (MTP) head from the converted GGUF. Pair with --mtp on a second run to publish trunk and MTP as two files. Note: the split form duplicates embeddings, so the bundled default is more space-efficient overall.",

@tarruda I don't think having mtp embedded in the model increases vram usage.
I tested this by quantizing Qwen3.6-9B to Q8 three times and loading in ik_llama (dropping caches/compacting memory before each run):

Summary Table

Setup System RAM (Before) System RAM (After) Pinned Memory Log VRAM (GPU) Used
1. No MTP in GGUF 4.9 GiB 23 GiB 8.05 GiB 1807 MiB
2. MTP Embedded (Disabled) 5.0 GiB 23 GiB 8.86 GiB β†’ 8.05 GiB 1825 MiB
3. MTP Embedded (Enabled) 5.0 GiB 29 GiB 9.10 GiB β†’ 9.05 GiB 3375 MiB

Which is a shame, got my hopes up thinking I could run a bigger Mimo-2.5-Pro or Kimi-K2.6 by stripping out the mtp.

Also looks like the storage cost is higher if you want MTP but include it separately:

 9.2G May 21 15:42 qwen3.5-9b-default.q8.gguf
 2.3G May 21 15:37 qwen3.5-9b-mtp.q8.gguf
 8.9G May 21 15:38 qwen3.5-9b-nomtp.q8.gguf

Personally prefer it be excluded like mmproj. For some models I have multiple quants sitting on the SSD.
I'll load a smaller one if I want speed over quality, a larger one to train control-vectors, etc. So I'd rather just point at the same draft model.
Though lately I've been doing 1 tensor per file splits and managing symlinks / pulling in specific tensors from @Thireus repos (and building ik_llama with -DGGML_MAX_CONTEXTS=2048)

https://nobodywho.ooo/posts/whats-in-a-gguf/

The projection model is often ~1GB in size - enough of an overhead that we definitely want to skip it when it's not used. But I think it's reasonable to provide two variants of the GGUF: one with projection weights, and one without. That could get us back to the situation of managing just one url to download, just one file to cache on disk, etc.

It the general "vibe" is to offer both for mmproj, so likely the same with mtp?

I just hope llama.cpp and ik_llama.cpp are both compatible with whatever the community adopts so I don't have to do https://huggingface.co/gghfez/MiMo-V2.5-Pro-unfused-test again -_-!

I am so waiting for mtp on the big models. Mimo v2.5 Pro and non pro has just won me over so much. Trick is to keep min p high though and cap thinking.

Sign up or log in to comment