APEX MTP Vision Apache-2.0

Agents-A1-MTP-APEX

English | 📖 中文文档

35B agentic MoE that reaches trillion-parameter performance · APEX-quantized GGUFs + BF16 + mmproj

🤖 About Agents-A1

Agents-A1 is a 35B-parameter Mixture-of-Experts agentic model from InternScience, post-trained on top of Qwen3.5-35B-A3B via a three-stage paradigm: full-domain SFT → domain-level teacher training → multi-teacher multi-domain on-policy distillation.

Despite operating in the ~35B model class, Agents-A1 delivers highly competitive performance against frontier-scale systems such as GPT-5.5, DeepSeek-V4-pro, and Kimi-K2.6 — achieving SOTA on Seal-0 (56.4), HiPhO (46.4), FrontierScience-Olympiad (79.0), IFBench (80.6), IFEval (94.8), and best-among-comparable on BrowseComp (75.5), XBench-DS-2510 (86.0), GAIA (96.0), SciCode (44.3), HLE (47.6), and MolBench-bind (56.8).

This GGUF package includes the mmproj-F16.gguf vision projector for multimodal (image + text) capabilities with llama.cpp. MTP layers are extracted from Qwen3.5-35B-A3B and injected into Agents-A1's safetensors (see MTP Extraction & Injection section). License: Apache-2.0.

🧠 Model Details
ArchitectureQwen3.5 MoE (Mixture of Experts)
Parameters35B total, 3B active per token
Experts256 routed experts, 8 active per token
Layers40 transformer layers + 1 MTP layer
Context262,144 tokens
MTP SourceQwen3.5-35B-A3B (1 layer, 785 tensors, injected)
Block Count41 (blk.0–39 + blk.40 MTP)
LicenseApache-2.0
🔧 MTP Extraction & Injection

The released InternScience/Agents-A1 checkpoint is a 40-layer Qwen3.5-35B-A3B MoE without MTP (Multi-Token Prediction) layers. To enable MTP acceleration in llama.cpp (which speeds up long-context generation by 10–30%), we extract the 1 MTP layer from Qwen3.5-35B-A3B and inject it into Agents-A1's safetensors before GGUF conversion.

Step 1 — Extract MTP tensors from Qwen3.5-35B-A3B

Source: J:\Models\Qwen3.5-35B-A3B-MTP (Qwen3.5-35B-A3B + native MTP)
from safetensors import safe_open
import json, os
·
src = r"J:\Models\Qwen3.5-35B-A3B-MTP"
with open(os.path.join(src, "model.safetensors.index.json")) as f:
    idx = json.load(f)
mtp_keys = [k for k in idx["weight_map"] if "mtp" in k.lower()]
print(f"Found {len(mtp_keys)} MTP tensors")  # 785

Step 2 — Add as a new safetensors shard (N+1)

Save 785 MTP tensors as a new shard
new_shard = "model.safetensors-15-of-15.safetensors"
save_file({k: get_tensor(k) for k in mtp_keys}, new_shard)
·
Update model.safetensors.index.json:
  · metadata.total_size += new_shard_size
  · weight_map: append new_shard path for each MTP key
  · DO NOT modify existing 14 shards (avoid touching original data)

Step 3 — Convert HF → BF16 GGUF with master llama.cpp

F:\llama.cpp\llama.cpp-master\convert_hf_to_gguf.py ^
  J:\Models\Agents-A1 ^
  --outfile J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^
  --outtype f16
·
Master version handles Qwen3.5MoE with MTP auto:
  · Normal layers: blk.0–39
  · MTP layer: blk.40.nextn.*  (785 tensors)

Step 4 — Quantize with APEX (Q4_K_M default, MTP at Q8_0)

F:\llama.cpp\...\llama-quantize.exe ^
  --imatrix J:\Models\Qwen3.5-35B-A3B.imatrix.gguf ^
  --tensor-type-file E:\apex-quant\configs\qwen36_35b_mtp_<tier>.txt ^
  J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^
  J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-APEX-I-<tier>.gguf ^
  Q4_K_M
·
APEX qwen36_35b_mtp_*.txt configs include blk.40 overrides
(Q8_0 for MTP across all tiers) — no manual patching needed.
📊 BenchLocal Results (APEX-I-Compact, 16.14 GB)
ModeToolCall-15BugFind-15HermesAgent-20MaxEff.
Thinking100888791.271.2
No Thinking971008593.157.1

RTX 5070 Ti 16GB + 128GB RAM · No-thinking mode achieves higher ceiling (BugFind +12) but suffers more retries on complex agent scenarios.

🚀 Usage

llama.cpp (text only)

hf download SC117/Agents-A1-MTP-APEX-GGUF --include "*.gguf" --local-dir ./models
./llama-server -m ./models/Agents-A1-MTP-APEX-I-Compact.gguf -ngl 99 -c 131072

llama.cpp (vision + text)

./llama-server -m ./models/Agents-A1-MTP-APEX-I-Compact.gguf --mmproj ./models/mmproj-F16.gguf -ngl 99 -c 131072

vLLM

vllm serve SC117/Agents-A1-MTP-APEX-GGUF --port 8000 --tensor-parallel-size 1 --max-model-len 262144 --reasoning-parser qwen3
·
Tool-call variant
vllm serve SC117/Agents-A1-MTP-APEX-GGUF --port 8000 --tensor-parallel-size 1 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder

SGLang

python3 -m sglang.launch_server --model-path "SC117/Agents-A1-MTP-APEX-GGUF" --host 0.0.0.0 --port 30000
🎛️ Recommended Sampling Parameters

From the official Agents-A1 model card:

temperature0.85
top_p0.95
top_k20
min_p0.0
presence_penalty1.1
repetition_penalty1.0
💡 What is APEX?

These GGUF files are quantized using APEX, an MoE-aware mixed-precision quantization technique. APEX classifies every tensor by its role — routed expert, shared expert, SSM, or attention — and applies a layer-wise precision gradient, giving sensitive edge layers (including the MTP layer) higher precision and compressing redundant middle layers more aggressively.

APEX beats Q8_0 perplexity at half the size — and even beats F16 in some cases.

The qwen36_35b_mtp_*.txt configs include overrides for blk.40 (the MTP layer), preserving it at Q8_0 across all four I- tiers. The same Qwen3.5-35B-A3B.imatrix.gguf is reused (same architecture, compatible MoE expert layout).

📦 APEX Quantization Tiers
FileSizeProfileBest For
*-APEX-I-Quality.gguf21.75 GBI-QualityHigh quality (Q6_K + iq4_xs attention)
*-APEX-I-Balanced.gguf24.21 GBI-BalancedBest all-rounder (Q6_K + Q5_K experts)
*-APEX-I-Compact.gguf16.14 GBI-CompactBest quality/size ratio (Q4_K default)
*-APEX-I-Mini.gguf13.36 GBI-MiniMost compact, fits in 16GB VRAM (Q3_K + iq2_s)

BF16 source: Agents-A1-MTP-BF16.gguf (66.19 GB). imatrix: Qwen3.5-35B-A3B.imatrix.gguf (reused from base model).

Links

Citation

@misc{bai2026scalinghorizonparametersreaching,
      title={Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent},
      author={Lei Bai and Zongsheng Cao and Yang Chen and Zhiyao Cui and Shangheng Du and Yue Fan and Shiyang Feng and Zijie Guo and Haonan He and Liang He and Xiaohan He and Shuyue Hu and Yusong Hu and Songtao Huang and Yichen Jiang and Hao Li and Xin Li and Dahua Lin and Weihao Lin and Fenghua Ling and Dongrui Liu and Zhuo Liu and Runmin Ma and Chunjiang Mu and others},
      year={2026},
      eprint={2606.30616},
      archivePrefix={arXiv}
}
Downloads last month
-
GGUF
Model size
36B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SC117/Agents-A1-MTP-APEX-GGUF

Quantized
(47)
this model

Paper for SC117/Agents-A1-MTP-APEX-GGUF