The released InternScience/Agents-A1 checkpoint is a 40-layer Qwen3.5-35B-A3B MoE without MTP (Multi-Token Prediction) layers. To enable MTP acceleration in llama.cpp (which speeds up long-context generation by 10β30%), we extract the 1 MTP layer from Qwen3.5-35B-A3B and inject it into Agents-A1's safetensors before GGUF conversion.
Step 1 β Extract MTP tensors from Qwen3.5-35B-A3B
# Source: J:\Models\Qwen3.5-35B-A3B-MTP (Qwen3.5-35B-A3B + native MTP)
from safetensors import safe_open
import json, os
src = r"J:\Models\Qwen3.5-35B-A3B-MTP"
with open(os.path.join(src, "model.safetensors.index.json")) as f:
idx = json.load(f)
mtp_keys = [k for k in idx["weight_map"] if "mtp" in k.lower()]
print(f"Found {len(mtp_keys)} MTP tensors") # 785
Step 2 β Add as a new safetensors shard (N+1)
# Save 785 MTP tensors as a new shard
new_shard = "model.safetensors-15-of-15.safetensors"
save_file({k: get_tensor(k) for k in mtp_keys}, new_shard)
# Update model.safetensors.index.json:
# - metadata.total_size += new_shard_size
# - weight_map: append new_shard path for each MTP key
# - DO NOT modify existing 14 shards (avoid touching original data)
Step 3 β Convert HF β BF16 GGUF with master llama.cpp
F:\llama.cpp\llama.cpp-master\convert_hf_to_gguf.py ^
J:\Models\Agents-A1 ^
--outfile J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^
--outtype f16
# Master version handles Qwen3.5MoE with MTP auto:
# - Normal layers: blk.0β39
# - MTP layer: blk.40.nextn.* (785 tensors)
Step 4 β Quantize with APEX (Q4_K_M default, MTP at Q8_0)
F:\llama.cpp\...\llama-quantize.exe ^
--imatrix J:\Models\Qwen3.5-35B-A3B.imatrix.gguf ^
--tensor-type-file E:\apex-quant\configs\qwen36_35b_mtp_<tier>.txt ^
J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^
J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-APEX-I-<tier>.gguf ^
Q4_K_M
# APEX qwen36_35b_mtp_*.txt configs include blk.40 overrides
# (Q8_0 for MTP across all tiers) β no manual patching needed.