Instructions to use LordNeel/Agents-A1-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use LordNeel/Agents-A1-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="LordNeel/Agents-A1-GGUF",
	filename="agents-a1-IQ4_XS-MTP-graft-headQ6.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use LordNeel/Agents-A1-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf LordNeel/Agents-A1-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf LordNeel/Agents-A1-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf LordNeel/Agents-A1-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf LordNeel/Agents-A1-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf LordNeel/Agents-A1-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf LordNeel/Agents-A1-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf LordNeel/Agents-A1-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf LordNeel/Agents-A1-GGUF:Q4_K_M

Use Docker

docker model run hf.co/LordNeel/Agents-A1-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use LordNeel/Agents-A1-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "LordNeel/Agents-A1-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LordNeel/Agents-A1-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/LordNeel/Agents-A1-GGUF:Q4_K_M

Ollama
How to use LordNeel/Agents-A1-GGUF with Ollama:
```
ollama run hf.co/LordNeel/Agents-A1-GGUF:Q4_K_M
```

Unsloth Studio

How to use LordNeel/Agents-A1-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for LordNeel/Agents-A1-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for LordNeel/Agents-A1-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for LordNeel/Agents-A1-GGUF to start chatting

How to use LordNeel/Agents-A1-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf LordNeel/Agents-A1-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "LordNeel/Agents-A1-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use LordNeel/Agents-A1-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf LordNeel/Agents-A1-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default LordNeel/Agents-A1-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use LordNeel/Agents-A1-GGUF with Docker Model Runner:
```
docker model run hf.co/LordNeel/Agents-A1-GGUF:Q4_K_M
```

Lemonade

How to use LordNeel/Agents-A1-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull LordNeel/Agents-A1-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Agents-A1-GGUF-Q4_K_M

List all available models

lemonade list

LordNeel commited on 2 days ago

Commit

798ad35

verified ·

1 Parent(s): 7d16500

docs: rewrite model card + regenerate metric charts

Browse files

- replace broken dual-axis size/speed combo with clean throughput chart
- fix ppl-delta label collision; add MTP speedup/acceptance chart
- add model summary, text-only callout, download, prompt format, citation

Files changed (6) hide show

README.md +160 -104
metrics/chart-kld-mean.png +0 -0
metrics/{chart-size-vs-generation.png → chart-mtp-speedup.png} +2 -2
metrics/chart-ppl-delta.png +0 -0
metrics/chart-quality-vs-size.png +0 -0
metrics/chart-throughput.png +3 -0

README.md CHANGED Viewed

@@ -1,7 +1,11 @@
 ---
 license: apache-2.0
 base_model:
 - InternScience/Agents-A1
 library_name: llama.cpp
 pipeline_tag: text-generation
 tags:
@@ -11,28 +15,49 @@ tags:
 - qwen3.5-moe
 - mixture-of-experts
 - agents-a1
 - nvfp4
 - mtp
 - speculative-decoding
 ---
-# Agents-A1 GGUF Quants
-High quality GGUF quantizations of [InternScience/Agents-A1](https://huggingface.co/InternScience/Agents-A1), a 35B Qwen3.5-MoE agent model.
-These files were produced from the BF16 Hugging Face checkpoint with a patched llama.cpp build that supports the `qwen35moe` architecture. The calibration pass used an importance matrix built from coding/instruction chat data, then each quant was benchmarked against the BF16 GGUF reference.
-## Recommended Files
-| Use case | File | Notes |
 |---|---|---|
-| Best small general-purpose quant | `agents-a1-IQ4_XS.gguf` | Strong quality for size, broad llama.cpp compatibility. |
-| Best single-user MTP throughput | `agents-a1-IQ4_XS-MTP-graft-headQ6.gguf` | IQ4_XS body with Q6_K MTP block; measured 1.22x over target-only in c1/128 chat serving. |
-| Highest MTP acceptance in this run | `agents-a1-Q4_K_M-MTP-graft-headQ6.gguf` with `SPEC_DRAFT_N_MAX=1` | 91.46% draft acceptance while still 1.15x over target-only. |
-| Fast Blackwell FP4 path | `agents-a1-NVFP4.gguf` | Tested on RTX PRO 6000 Blackwell. Requires runtime support for `GGML_TYPE_NVFP4`. |
-| Safer quality step up | `agents-a1-Q5_K_M.gguf` | Lower KLD than IQ4_XS with larger size. |
 | Closest to BF16 by KLD | `agents-a1-Q6_K.gguf` | Best KLD in this eval set. |
-| High precision archival quant | `agents-a1-Q8_0.gguf` | Largest quantized file. |
 ## Files
@@ -40,24 +65,106 @@ These files were produced from the BF16 Hugging Face checkpoint with a patched l
 |---|---:|---|
 | Q3_K_M | 16.76 GB | Smallest included quant. |
 | IQ4_XS | 18.73 GB | Recommended compact quant. |
-| IQ4_XS-MTP-graft-headQ6 | 19.42 GB | IQ4_XS body plus integrated Q6_K/F32 MTP block. |
-| NVFP4 | 19.72 GB | Blackwell-oriented FP4 GGUF, output head kept at Q6_K by quality rule. |
 | Q4_K_M | 21.17 GB | Standard K-quant. |
-| Q4_K_M-MTP-graft-headQ6 | 21.86 GB | Q4_K_M body plus integrated Q6_K/F32 MTP block. |
 | Q5_K_M | 24.73 GB | Strong quality/size tradeoff. |
 | Q6_K | 28.51 GB | Lowest mean KLD in this run. |
-| Q8_0 | 36.90 GB | Highest precision quant. |
 ## Metrics
 Hardware and runtime profile:
-- GPU: single NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, full offload
-- llama.cpp flags: `-ngl 99 -sm none -fa on -p 512 -n 128 -b 4096 -ub 512 -r 3`
-- PPL: `llama-perplexity`, context 2048, 64 rendered eval conversations, 3 chunks
-- KLD: approximate `KL(P_BF16 || P_quant)` over top-64 next-token distributions on 32 prompts
-The PPL eval is intentionally small, so treat PPL deltas as directional. KLD and top-1 agreement are more useful here for quant-to-BF16 comparison.
 | Model | Size GB | Prompt tok/s | Gen tok/s | PPL | PPL delta | KLD mean | KLD p95 | Top-1 match |
 |---|---:|---:|---:|---:|---:|---:|---:|---:|
@@ -72,24 +179,19 @@ The PPL eval is intentionally small, so treat PPL deltas as directional. KLD and
 ### Charts
-![Size vs generation speed](metrics/chart-size-vs-generation.png)
 ![Mean KLD](metrics/chart-kld-mean.png)
 ![PPL delta](metrics/chart-ppl-delta.png)
-![Quality vs size](metrics/chart-quality-vs-size.png)
 Raw metric files are in `metrics/`; KLD reports, checksums, and the MTP audit are in `reports/`.
-## MTP Q4 Variants
-The upstream Agents-A1 checkpoint used for the first GGUF release advertises
-MTP in config but does not ship `mtp.*`/`blk.40.*` tensors. The two MTP Q4
-variants here graft in the Agents-A1 MTPLX MTP sidecar from
-`wang-yang/Agents-A1-MTPLX-Q4`, then convert it with llama.cpp's Qwen3.5-MoE
-MTP path. The dense MTP block is preserved at Q6_K while the model body is
-quantized to IQ4_XS or Q4_K_M.
 Structural checks for both MTP GGUFs:
@@ -101,92 +203,46 @@ Structural checks for both MTP GGUFs:
 | `blk.40.*` MTP tensors | 20 |
 | `blk.40.nextn.*` tensors | 4 |
-Single-user serving profile: one RTX PRO 6000 Blackwell Max-Q 96 GB GPU,
-`PARALLEL=1`, `CTX_SIZE=8192`, streaming chat completions, `12` requests,
-`128` max tokens, `temperature=0`, `top_p=1`.
 | Quant | Mode | Aggregate tok/s | Speedup vs target-only | Draft acceptance | Mean accepted length | Acceptance by position |
 |---|---:|---:|---:|---:|---:|---|
-| IQ4_XS-MTP | target-only | 224.59 | 1.00x | n/a | n/a | n/a |
-| IQ4_XS-MTP | `draft-mtp`, `n_max=2` | 275.03 | 1.22x | 76.51% | 2.52 | `(0.830, 0.692)` |
-| IQ4_XS-MTP | `draft-mtp`, `n_max=1` | 259.58 | 1.16x | 86.47% | 1.86 | `(0.865)` |
-| Q4_K_M-MTP | target-only | 230.48 | 1.00x | n/a | n/a | n/a |
-| Q4_K_M-MTP | `draft-mtp`, `n_max=2` | 273.80 | 1.19x | 77.18% | 2.53 | `(0.847, 0.687)` |
-| Q4_K_M-MTP | `draft-mtp`, `n_max=1` | 264.88 | 1.15x | 91.46% | 1.91 | `(0.915)` |
-Recommended low-latency/single-user throughput profile: `SPEC_DRAFT_N_MAX=2`.
-Recommended high-acceptance fallback: `SPEC_DRAFT_N_MAX=1`.
-Detailed MTP evidence is in:
 - `reports/agents-a1-mtp-q4-profile-summary.md`
 - `reports/agents-a1-mtp-q4-profile-summary.json`
 - `configs/mtp_profiles.yaml`
-## Usage
-Example with the recommended compact quant:
-```bash
-llama-server \
-  -m agents-a1-IQ4_XS.gguf \
-  -ngl 99 \
-  -c 8192 \
-  -b 4096 \
-  -ub 512 \
-  --flash-attn on
-```
-NVFP4 example:
-```bash
-llama-server \
-  -m agents-a1-NVFP4.gguf \
-  -ngl 99 \
-  -c 8192 \
-  -b 4096 \
-  -ub 512 \
-  --flash-attn on
 ```
-The NVFP4 artifact is a standard GGUF using the `NVFP4` tensor type, but runtime support is still newer and less universal than K-quants or IQ4_XS. It was tested on a Blackwell GPU with a llama.cpp build reporting `BLACKWELL_NATIVE_FP4 = 1`.
-MTP example:
-```bash
-LLAMA_SPEC_MAX_DRAFTING_SLOTS=1 \
-LLAMA_MTP_FAST_BACKEND_SAMPLE=1 \
-LLAMA_MTP_DRAFT_TOP_K=1 \
-LLAMA_MTP_DRAFT_TOP_P=1 \
-LLAMA_MTP_DRAFT_TEMP=1 \
-llama-server \
-  -m agents-a1-IQ4_XS-MTP-graft-headQ6.gguf \
-  -ngl 99 \
-  -c 8192 \
-  -b 4096 \
-  -ub 512 \
-  --flash-attn on \
-  --reasoning off \
-  --spec-type draft-mtp \
-  --spec-draft-n-max 2 \
-  --spec-draft-n-min 0 \
-  --spec-draft-backend-sampling
-```
-For the high-acceptance profile, change `--spec-draft-n-max 2` to
-`--spec-draft-n-max 1`.
-## MTP Status
-The original upstream snapshot remains config-only for MTP; see
-`reports/mtp-weights-audit.json`. The new `*-MTP-graft-headQ6.gguf` files are
-true integrated MTP GGUFs built from the Agents-A1 MTPLX MTP sidecar.
-## Provenance
-- Base model: `InternScience/Agents-A1`
-- License: Apache-2.0, inherited from the base model
-- Quantization source: BF16 GGUF converted from the Hugging Face checkpoint
-- MTP source: `wang-yang/Agents-A1-MTPLX-Q4` sidecar grafted onto the base Agents-A1 checkpoint
-- Calibration: coding/instruction chat data rendered with the model chat template
-- Quantizer: patched llama.cpp with Qwen3.5-MoE and NVFP4 support

 ---
 license: apache-2.0
+language:
+- en
 base_model:
 - InternScience/Agents-A1
+base_model_relation: quantized
+quantized_by: LordNeel
 library_name: llama.cpp
 pipeline_tag: text-generation
 tags:
 - qwen3.5-moe
 - mixture-of-experts
 - agents-a1
+- agent
 - nvfp4
 - mtp
 - speculative-decoding
+- imatrix
 ---
+# Agents-A1 GGUF
+GGUF quantizations of [**InternScience/Agents-A1**](https://huggingface.co/InternScience/Agents-A1) — a 35B Mixture-of-Experts **agentic** model (Qwen3.5-MoE architecture) built for long-horizon search, engineering, scientific research, instruction-following, and tool-calling.
+Files were produced from the BF16 Hugging Face checkpoint with a patched `llama.cpp` build that supports the `qwen35moe` architecture. Each quant uses an importance matrix (imatrix) built from coding/instruction-chat calibration data, and every file was benchmarked against the BF16 GGUF reference (PPL, KL-divergence, top-1 agreement).
+> [!IMPORTANT]
+> **These are text-only GGUFs.** The base model is multimodal (vision + video), but no `mmproj` projector is shipped here, so image/video input is not available with these files. Use them for text and agentic/tool-calling workloads.
+## Model summary
+| | |
+|---|---|
+| Base model | [InternScience/Agents-A1](https://huggingface.co/InternScience/Agents-A1) ([paper](https://arxiv.org/abs/2606.30616) · [homepage](https://internscience.github.io/Agents-A1/) · [GitHub](https://github.com/InternScience/Agents-A1)) |
+| Architecture | Qwen3.5-MoE, hybrid linear/full attention (full attention every 4th layer) |
+| Parameters | ~35B total, ~3B active per token (A3B-class) |
+| Experts | 256 experts, 8 active + 1 shared per token |
+| Layers | 40 transformer layers + 1 MTP layer |
+| Context length | 262,144 (256K) native |
+| Language | English |
+| License | Apache-2.0 (inherited from base) |
+| Quantized by | [LordNeel](https://huggingface.co/LordNeel) |
+## Which file should I pick?
+| Goal | File | Notes |
 |---|---|---|
+| Best small general-purpose quant | `agents-a1-IQ4_XS.gguf` | Strong quality for size, broad `llama.cpp` compatibility. |
+| Best single-user MTP throughput | `agents-a1-IQ4_XS-MTP-graft-headQ6.gguf` | IQ4_XS body + Q6_K MTP block; **1.22×** over target-only at `n_max=2`. |
+| Highest MTP draft acceptance | `agents-a1-Q4_K_M-MTP-graft-headQ6.gguf` (`SPEC_DRAFT_N_MAX=1`) | **91.46%** acceptance, still 1.15× over target-only. |
+| Fast Blackwell FP4 path | `agents-a1-NVFP4.gguf` | Tested on RTX PRO 6000 Blackwell. Needs runtime support for `GGML_TYPE_NVFP4`. |
+| Safer quality step up | `agents-a1-Q5_K_M.gguf` | Lower KLD than IQ4_XS, larger size. |
 | Closest to BF16 by KLD | `agents-a1-Q6_K.gguf` | Best KLD in this eval set. |
+| High-precision archival | `agents-a1-Q8_0.gguf` | Largest quant. |
+**Sizing:** for full GPU offload, give yourself roughly `file size + KV cache` of VRAM. K-quants (`Q4_K_M`, `Q5_K_M`, `Q6_K`) are the most portable. `IQ4_XS` is an I-quant and benefits from the bundled imatrix. `NVFP4` is the fastest prefill path but needs a Blackwell-class GPU and a recent FP4-capable `llama.cpp` build.
 ## Files
 |---|---:|---|
 | Q3_K_M | 16.76 GB | Smallest included quant. |
 | IQ4_XS | 18.73 GB | Recommended compact quant. |
+| IQ4_XS-MTP-graft-headQ6 | 19.42 GB | IQ4_XS body + integrated Q6_K/F32 MTP block. |
+| NVFP4 | 19.72 GB | Blackwell-oriented FP4 GGUF; output head kept at Q6_K by quality rule. |
 | Q4_K_M | 21.17 GB | Standard K-quant. |
+| Q4_K_M-MTP-graft-headQ6 | 21.86 GB | Q4_K_M body + integrated Q6_K/F32 MTP block. |
 | Q5_K_M | 24.73 GB | Strong quality/size tradeoff. |
 | Q6_K | 28.51 GB | Lowest mean KLD in this run. |
+| Q8_0 | 36.90 GB | Highest-precision quant. |
+## Download
+```bash
+pip install -U "huggingface_hub[cli]"
+# download a single quant into ./agents-a1
+hf download LordNeel/Agents-A1-GGUF agents-a1-IQ4_XS.gguf --local-dir ./agents-a1
+```
+You generally want a **recent `llama.cpp` build with `qwen35moe` support**; the NVFP4 and MTP files need newer builds still (see the relevant sections below).
+## Usage
+Standard inference with the recommended compact quant:
+```bash
+llama-server \
+  -m agents-a1-IQ4_XS.gguf \
+  -ngl 99 \
+  -c 8192 \
+  -b 4096 \
+  -ub 512 \
+  --flash-attn on
+```
+`-c 8192` is just a starting point — the model's native context is 256K, so raise `-c` as your VRAM allows.
+**NVFP4** (Blackwell):
+```bash
+llama-server \
+  -m agents-a1-NVFP4.gguf \
+  -ngl 99 -c 8192 -b 4096 -ub 512 --flash-attn on
+```
+The NVFP4 artifact is a standard GGUF using the `NVFP4` tensor type, but runtime support is newer and less universal than K-quants or IQ4_XS. It was tested on a Blackwell GPU with a `llama.cpp` build reporting `BLACKWELL_NATIVE_FP4 = 1`.
+**MTP / speculative decoding** (single-user throughput):
+```bash
+LLAMA_SPEC_MAX_DRAFTING_SLOTS=1 \
+LLAMA_MTP_FAST_BACKEND_SAMPLE=1 \
+LLAMA_MTP_DRAFT_TOP_K=1 \
+LLAMA_MTP_DRAFT_TOP_P=1 \
+LLAMA_MTP_DRAFT_TEMP=1 \
+llama-server \
+  -m agents-a1-IQ4_XS-MTP-graft-headQ6.gguf \
+  -ngl 99 -c 8192 -b 4096 -ub 512 --flash-attn on \
+  --reasoning off \
+  --spec-type draft-mtp \
+  --spec-draft-n-max 2 \
+  --spec-draft-n-min 0 \
+  --spec-draft-backend-sampling
+```
+For the high-acceptance profile, change `--spec-draft-n-max 2` to `--spec-draft-n-max 1`.
+Python with `llama-cpp-python`:
+```python
+from llama_cpp import Llama
+llm = Llama.from_pretrained(
+    repo_id="LordNeel/Agents-A1-GGUF",
+    filename="agents-a1-IQ4_XS.gguf",
+)
+```
+### Prompt format
+Agents-A1 uses a Qwen-style ChatML template (embedded in the GGUF, so `llama-server`/`llama-cli` chat endpoints apply it automatically):
+```
+<|im_start|>system
+{system_prompt}<|im_end|>
+<|im_start|>user
+{user_message}<|im_end|>
+<|im_start|>assistant
+```
+The model natively supports function calling / tool use — see the [base model card](https://huggingface.co/InternScience/Agents-A1) for agentic and tool-calling details.
 ## Metrics
 Hardware and runtime profile:
+- **GPU:** single NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, full offload
+- **`llama.cpp` flags:** `-ngl 99 -sm none -fa on -p 512 -n 128 -b 4096 -ub 512 -r 3`
+- **PPL:** `llama-perplexity`, context 2048, 64 rendered eval conversations, 3 chunks
+- **KLD:** approximate `KL(P_BF16 || P_quant)` over top-64 next-token distributions on 32 prompts
+The PPL eval is intentionally small, so treat PPL deltas as directional. KLD and top-1 agreement are the more useful quant-to-BF16 quality signals here.
 | Model | Size GB | Prompt tok/s | Gen tok/s | PPL | PPL delta | KLD mean | KLD p95 | Top-1 match |
 |---|---:|---:|---:|---:|---:|---:|---:|---:|
 ### Charts
+![Throughput by quant](metrics/chart-throughput.png)
+![Quality vs size](metrics/chart-quality-vs-size.png)
 ![Mean KLD](metrics/chart-kld-mean.png)
 ![PPL delta](metrics/chart-ppl-delta.png)
 Raw metric files are in `metrics/`; KLD reports, checksums, and the MTP audit are in `reports/`.
+## MTP (Multi-Token Prediction) Q4 variants
+The upstream Agents-A1 checkpoint used for the first GGUF release advertises MTP in config but does not ship `mtp.*` / `blk.40.*` tensors. The two MTP Q4 variants here graft in the Agents-A1 MTPLX MTP sidecar from [`wang-yang/Agents-A1-MTPLX-Q4`](https://huggingface.co/wang-yang/Agents-A1-MTPLX-Q4), then convert it with `llama.cpp`'s Qwen3.5-MoE MTP path. The dense MTP block is preserved at Q6_K while the model body is quantized to IQ4_XS or Q4_K_M.
 Structural checks for both MTP GGUFs:
 | `blk.40.*` MTP tensors | 20 |
 | `blk.40.nextn.*` tensors | 4 |
+Single-user serving profile: one RTX PRO 6000 Blackwell Max-Q 96 GB GPU, `PARALLEL=1`, `CTX_SIZE=8192`, streaming chat completions, 12 requests, 128 max tokens, `temperature=0`, `top_p=1`.
 | Quant | Mode | Aggregate tok/s | Speedup vs target-only | Draft acceptance | Mean accepted length | Acceptance by position |
 |---|---:|---:|---:|---:|---:|---|
+| IQ4_XS-MTP | target-only | 224.59 | 1.00× | n/a | n/a | n/a |
+| IQ4_XS-MTP | `draft-mtp`, `n_max=2` | 275.03 | 1.22× | 76.51% | 2.52 | `(0.830, 0.692)` |
+| IQ4_XS-MTP | `draft-mtp`, `n_max=1` | 259.58 | 1.16× | 86.47% | 1.86 | `(0.865)` |
+| Q4_K_M-MTP | target-only | 230.48 | 1.00× | n/a | n/a | n/a |
+| Q4_K_M-MTP | `draft-mtp`, `n_max=2` | 273.80 | 1.19× | 77.18% | 2.53 | `(0.847, 0.687)` |
+| Q4_K_M-MTP | `draft-mtp`, `n_max=1` | 264.88 | 1.15× | 91.46% | 1.91 | `(0.915)` |
+![MTP speedup and acceptance](metrics/chart-mtp-speedup.png)
+Recommended low-latency / single-user throughput profile: `SPEC_DRAFT_N_MAX=2`. Recommended high-acceptance fallback: `SPEC_DRAFT_N_MAX=1`.
+Detailed MTP evidence:
 - `reports/agents-a1-mtp-q4-profile-summary.md`
 - `reports/agents-a1-mtp-q4-profile-summary.json`
+- `reports/mtp-weights-audit.json` (audit of the config-only upstream snapshot)
 - `configs/mtp_profiles.yaml`
+## Provenance & credits
+- **Base model:** [`InternScience/Agents-A1`](https://huggingface.co/InternScience/Agents-A1) — *Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent* ([arXiv:2606.30616](https://arxiv.org/abs/2606.30616))
+- **MTP source:** [`wang-yang/Agents-A1-MTPLX-Q4`](https://huggingface.co/wang-yang/Agents-A1-MTPLX-Q4) sidecar, grafted onto the base checkpoint
+- **Quantization source:** BF16 GGUF converted from the Hugging Face checkpoint
+- **Calibration:** coding/instruction-chat data rendered with the model chat template (imatrix)
+- **Quantizer:** patched `llama.cpp` with Qwen3.5-MoE and NVFP4 support
+- **License:** Apache-2.0, inherited from the base model
+## Citation
+If you use these quantizations, please cite the base model:
+```bibtex
+@article{agentsa1_2026,
+  title   = {Agents-A1: Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent},
+  author  = {InternScience},
+  journal = {arXiv preprint arXiv:2606.30616},
+  year    = {2026}
+}
 ```