Does not work with latest llama.cpp

#1
by pcomte - opened
0.00.260.184 E llama_model_load: error loading model: missing tensor 'blk.40.attn_norm.weight'
0.00.260.310 E llama_model_load_from_file_impl: failed to load model
0.00.260.357 E common_fit_params: encountered an error while trying to fit params to free device memory: failed to load model
0.00.448.882 E llama_model_load: error loading model: missing tensor 'blk.40.attn_norm.weight'
0.00.448.885 E llama_model_load_from_file_impl: failed to load model
0.00.448.886 E common_init_from_params: failed to load model 'qwen-agentworld-35b-a3b-q4_k_m.gguf'
0.00.448.888 E srv    load_model: failed to load model, 'qwen-agentworld-35b-a3b-q4_k_m.gguf'
0.00.448.889 I srv    operator(): operator(): cleaning up before exit...
0.00.449.032 E srv  llama_server: exiting due to model loading error
llama-server --version
version: 9770 (75ad0b23e)
built with AppleClang 21.0.0.21000099 for Darwin arm64

The gguf was converted incorrectly, gotta wait until somebody posts a real one rather than an autoconvert

Title: llama.cpp crash: missing tensor 'blk.40.attn_norm.weight' (Incorrect GGUF Metadata)

Description:
Hi, thanks for uploading this quant!

I ran into an issue where loading this GGUF in llama.cpp (and standard backends like Ollama/LM Studio) causes a crash on load:

E llama_model_load: error loading model: missing tensor 'blk.40.attn_norm.weight'

The Cause

The quantization script seems to have exported the incorrect metadata for the layer count. The model metadata claims qwen35moe.block_count: 41 and qwen35moe.nextn_predict_layers: 1 (presumably due to the MTP layers). However, the actual tensors for layer 41 (blk.40) and the MTP projection (blk.39.nextn.eh_proj.weight) were not included in the GGUF file. When llama.cpp tries to load them, it hits a missing tensor error.

Temporary Workaround for Users

For anyone else running into this, you can bypass the crash by forcing llama.cpp to ignore the broken metadata and expect the standard 40 blocks with no MTP layers using the --override-kv flag:

llama-server -m qwen-agentworld-35b-a3b-q4_k_m.gguf --override-kv "qwen35moe.block_count=int:40,qwen35moe.nextn_predict_layers=int:0"

To the uploader: A re-quantization with the latest convert_hf_to_gguf.py script that either correctly exports the MTP tensors, or correctly strips them out and sets the metadata to 40 blocks, should permanently fix this for all users.

Sign up or log in to comment