🤖 About Agents-A1

Agents-A1 is a 35B-parameter Mixture-of-Experts agentic model from InternScience, post-trained on top of Qwen3.5-35B-A3B via a three-stage paradigm: full-domain SFT → domain-level teacher training → multi-teacher multi-domain on-policy distillation.

Despite operating in the ~35B model class, Agents-A1 delivers highly competitive performance against frontier-scale systems such as GPT-5.5, DeepSeek-V4-pro, and Kimi-K2.6 — achieving SOTA on Seal-0 (56.4), HiPhO (46.4), FrontierScience-Olympiad (79.0), IFBench (80.6), IFEval (94.8), and best-among-comparable on BrowseComp (75.5), XBench-DS-2510 (86.0), GAIA (96.0), SciCode (44.3), HLE (47.6), and MolBench-bind (56.8).

This GGUF package includes the mmproj-F16.gguf vision projector for multimodal (image + text) capabilities with llama.cpp. MTP layers are extracted from Qwen3.5-35B-A3B and injected into Agents-A1's safetensors (see MTP Extraction & Injection section). License: Apache-2.0.

🧠 Model Details

Architecture	Qwen3.5 MoE (Mixture of Experts)
Parameters	35B total, 3B active per token
Experts	256 routed experts, 8 active per token
Layers	40 transformer layers + 1 MTP layer
Context	262,144 tokens
MTP Source	Qwen3.5-35B-A3B (1 layer, 785 tensors, injected)
Block Count	41 (blk.0–39 + blk.40 MTP)
License	Apache-2.0

🔧 MTP Extraction & Injection

The released InternScience/Agents-A1 checkpoint is a 40-layer Qwen3.5-35B-A3B MoE without MTP (Multi-Token Prediction) layers. To enable MTP acceleration in llama.cpp (which speeds up long-context generation by 10–30%), we extract the 1 MTP layer from Qwen3.5-35B-A3B and inject it into Agents-A1's safetensors before GGUF conversion.

Step 1 — Extract MTP tensors from Qwen3.5-35B-A3B

# Source: J:\Models\Qwen3.5-35B-A3B-MTP (Qwen3.5-35B-A3B + native MTP) from safetensors import safe_open import json, os src = r"J:\Models\Qwen3.5-35B-A3B-MTP" with open(os.path.join(src, "model.safetensors.index.json")) as f: idx = json.load(f) mtp_keys = [k for k in idx["weight_map"] if "mtp" in k.lower()] print(f"Found {len(mtp_keys)} MTP tensors") # 785

Step 2 — Add as a new safetensors shard (N+1)

# Save 785 MTP tensors as a new shard new_shard = "model.safetensors-15-of-15.safetensors" save_file({k: get_tensor(k) for k in mtp_keys}, new_shard) # Update model.safetensors.index.json: # - metadata.total_size += new_shard_size # - weight_map: append new_shard path for each MTP key # - DO NOT modify existing 14 shards (avoid touching original data)

Step 3 — Convert HF → BF16 GGUF with master llama.cpp

F:\llama.cpp\llama.cpp-master\convert_hf_to_gguf.py ^ J:\Models\Agents-A1 ^ --outfile J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^ --outtype f16 # Master version handles Qwen3.5MoE with MTP auto: # - Normal layers: blk.0–39 # - MTP layer: blk.40.nextn.* (785 tensors)

Step 4 — Quantize with APEX (Q4_K_M default, MTP at Q8_0)

F:\llama.cpp\...\llama-quantize.exe ^ --imatrix J:\Models\Qwen3.5-35B-A3B.imatrix.gguf ^ --tensor-type-file E:\apex-quant\configs\qwen36_35b_mtp_<tier>.txt ^ J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-BF16.gguf ^ J:\Models\Agents-A1-MTP-GGUF\Agents-A1-MTP-APEX-I-<tier>.gguf ^ Q4_K_M # APEX qwen36_35b_mtp_*.txt configs include blk.40 overrides # (Q8_0 for MTP across all tiers) — no manual patching needed.

📊 BenchLocal Results (APEX-I-Compact, 16.14 GB)

Mode	ToolCall-15	BugFind-15	HermesAgent-20	Max	Eff.
Thinking	100	88	87	91.2	71.2
No Thinking	97	100	85	93.1	57.1

RTX 5070 Ti 16GB + 128GB RAM · No-thinking mode achieves higher ceiling (BugFind +12) but suffers more retries on complex agent scenarios.

🚀 Usage

llama.cpp (text only)

hf download SC117/Agents-A1-MTP-APEX-GGUF --include "*.gguf" --local-dir ./models ./llama-server -m ./models/Agents-A1-MTP-APEX-I-Compact.gguf -ngl 99 -c 131072

llama.cpp (vision + text)

./llama-server -m ./models/Agents-A1-MTP-APEX-I-Compact.gguf --mmproj ./models/mmproj-F16.gguf -ngl 99 -c 131072

vLLM

vllm serve SC117/Agents-A1-MTP-APEX-GGUF --port 8000 --tensor-parallel-size 1 --max-model-len 262144 --reasoning-parser qwen3 # Tool-call variant vllm serve SC117/Agents-A1-MTP-APEX-GGUF --port 8000 --tensor-parallel-size 1 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder

SGLang

python3 -m sglang.launch_server --model-path "SC117/Agents-A1-MTP-APEX-GGUF" --host 0.0.0.0 --port 30000

🎛️ Recommended Sampling Parameters

From the official Agents-A1 model card:

temperature	0.85
top_p	0.95
top_k	20
min_p	0.0
presence_penalty	1.1
repetition_penalty	1.0

💡 What is APEX?

These GGUF files are quantized using APEX, an MoE-aware mixed-precision quantization technique. APEX classifies every tensor by its role — routed expert, shared expert, SSM, or attention — and applies a layer-wise precision gradient, giving sensitive edge layers (including the MTP layer) higher precision and compressing redundant middle layers more aggressively.

APEX beats Q8_0 perplexity at half the size — and even beats F16 in some cases.

The qwen36_35b_mtp_*.txt configs include overrides for blk.40 (the MTP layer), preserving it at Q8_0 across all four I- tiers. The same Qwen3.5-35B-A3B.imatrix.gguf is reused (same architecture, compatible MoE expert layout).

📦 APEX Quantization Tiers

File	Size	Profile	Best For
`*-APEX-I-Quality.gguf`	21.75 GB	I-Quality	High quality (Q6_K + iq4_xs attention)
`*-APEX-I-Balanced.gguf`	24.21 GB	I-Balanced	Best all-rounder (Q6_K + Q5_K experts)
`*-APEX-I-Compact.gguf`	16.14 GB	I-Compact	Best quality/size ratio (Q4_K default)
`*-APEX-I-Mini.gguf`	13.36 GB	I-Mini	Most compact, fits in 16GB VRAM (Q3_K + iq2_s)

BF16 source: Agents-A1-MTP-BF16.gguf (66.19 GB). imatrix: Qwen3.5-35B-A3B.imatrix.gguf (reused from base model).