---
license: apache-2.0
base_model:
- InternScience/Agents-A1
library_name: llama.cpp
pipeline_tag: text-generation
tags:
- gguf
- quantized
- llama-cpp
- qwen3.5-moe
- mixture-of-experts
- agents-a1
- nvfp4
- mtp
- speculative-decoding
- mmproj
- multimodal
- vision
- qwen3vl
---

# Agents-A1 GGUF Quants

High quality GGUF quantizations of [InternScience/Agents-A1](https://huggingface.co/InternScience/Agents-A1), a 35B Qwen3.5-MoE agent model.

These files were produced from the BF16 Hugging Face checkpoint with a patched llama.cpp build that supports the `qwen35moe` architecture. The calibration pass used an importance matrix built from coding/instruction chat data, then each quant was benchmarked against the BF16 GGUF reference.

## Recommended Files

| Use case | File | Notes |
|---|---|---|
| Best small general-purpose quant | `agents-a1-IQ4_XS.gguf` | Strong quality for size, broad llama.cpp compatibility. |
| Best single-user MTP throughput | `agents-a1-IQ4_XS-MTP-graft-headQ6.gguf` | IQ4_XS body with Q6_K MTP block; measured 1.22x over target-only in c1/128 chat serving. |
| Highest MTP acceptance in this run | `agents-a1-Q4_K_M-MTP-graft-headQ6.gguf` with `SPEC_DRAFT_N_MAX=1` | 91.46% draft acceptance while still 1.15x over target-only. |
| Vision / image input for Q4+ quants | `mmproj-agents-a1-bf16.gguf` | Shared BF16 Qwen3VL mmproj for IQ4_XS, Q4_K_M, Q5_K_M, Q6_K, Q8_0, NVFP4, and the Q4 MTP variants. |
| Fast Blackwell FP4 path | `agents-a1-NVFP4.gguf` | Tested on RTX PRO 6000 Blackwell. Requires runtime support for `GGML_TYPE_NVFP4`. |
| Safer quality step up | `agents-a1-Q5_K_M.gguf` | Lower KLD than IQ4_XS with larger size. |
| Closest to BF16 by KLD | `agents-a1-Q6_K.gguf` | Best KLD in this eval set. |
| High precision archival quant | `agents-a1-Q8_0.gguf` | Largest quantized file. |

## Files

| Quant | File size | Notes |
|---|---:|---|
| Q3_K_M | 16.76 GB | Smallest included quant. |
| IQ4_XS | 18.73 GB | Recommended compact quant. |
| IQ4_XS-MTP-graft-headQ6 | 19.42 GB | IQ4_XS body plus integrated Q6_K/F32 MTP block. |
| NVFP4 | 19.72 GB | Blackwell-oriented FP4 GGUF, output head kept at Q6_K by quality rule. |
| Q4_K_M | 21.17 GB | Standard K-quant. |
| Q4_K_M-MTP-graft-headQ6 | 21.86 GB | Q4_K_M body plus integrated Q6_K/F32 MTP block. |
| Q5_K_M | 24.73 GB | Strong quality/size tradeoff. |
| Q6_K | 28.51 GB | Lowest mean KLD in this run. |
| Q8_0 | 36.90 GB | Highest precision quant. |
| mmproj BF16 | 0.90 GB | Shared Qwen3VL vision encoder/projector for Q4-class and higher text GGUFs. |

## Metrics

Hardware and runtime profile:

- GPU: single NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, full offload
- llama.cpp flags: `-ngl 99 -sm none -fa on -p 512 -n 128 -b 4096 -ub 512 -r 3`
- PPL: `llama-perplexity`, context 2048, 64 rendered eval conversations, 3 chunks
- KLD: approximate `KL(P_BF16 || P_quant)` over top-64 next-token distributions on 32 prompts

The PPL eval is intentionally small, so treat PPL deltas as directional. KLD and top-1 agreement are more useful here for quant-to-BF16 comparison.

| Model | Size GB | Prompt tok/s | Gen tok/s | PPL | PPL delta | KLD mean | KLD p95 | Top-1 match |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
| BF16 reference | 69.38 | 3418.9 | 161.8 | 1.3031 | 0.0000 | 0.0000 | 0.0000 | 32/32 |
| Q3_K_M | 16.76 | 6779.5 | 269.0 | 1.3101 | +0.0070 | 0.0655 | 0.2155 | 28/32 |
| IQ4_XS | 18.73 | 7719.5 | 258.1 | 1.3038 | +0.0007 | 0.0151 | 0.0654 | 29/32 |
| NVFP4 | 19.72 | 9064.0 | 265.1 | 1.3063 | +0.0032 | 0.0420 | 0.1473 | 31/32 |
| Q4_K_M | 21.17 | 7230.8 | 262.6 | 1.3016 | -0.0015 | 0.1225 | 0.3349 | 27/32 |
| Q5_K_M | 24.73 | 7021.4 | 257.9 | 1.3041 | +0.0010 | 0.0091 | 0.0335 | 30/32 |
| Q6_K | 28.51 | 6294.0 | 244.6 | 1.3040 | +0.0009 | 0.0049 | 0.0178 | 32/32 |
| Q8_0 | 36.90 | 7431.3 | 222.7 | 1.3036 | +0.0005 | 0.0053 | 0.0063 | 30/32 |

### Charts

![Size vs generation speed](metrics/chart-size-vs-generation.png)

![Mean KLD](metrics/chart-kld-mean.png)

![PPL delta](metrics/chart-ppl-delta.png)

![Quality vs size](metrics/chart-quality-vs-size.png)

Raw metric files are in `metrics/`; KLD reports, checksums, and the MTP audit are in `reports/`.

## MTP Q4 Variants

The upstream Agents-A1 checkpoint used for the first GGUF release advertises
MTP in config but does not ship `mtp.*`/`blk.40.*` tensors. The two MTP Q4
variants here graft in the Agents-A1 MTPLX MTP sidecar from
`wang-yang/Agents-A1-MTPLX-Q4`, then convert it with llama.cpp's Qwen3.5-MoE
MTP path. The dense MTP block is preserved at Q6_K while the model body is
quantized to IQ4_XS or Q4_K_M.

Structural checks for both MTP GGUFs:

| Check | Value |
|---|---:|
| GGUF tensors | 753 |
| `qwen35moe.block_count` | 41 |
| `qwen35moe.nextn_predict_layers` | 1 |
| `blk.40.*` MTP tensors | 20 |
| `blk.40.nextn.*` tensors | 4 |

Single-user serving profile: one RTX PRO 6000 Blackwell Max-Q 96 GB GPU,
`PARALLEL=1`, `CTX_SIZE=8192`, streaming chat completions, `12` requests,
`128` max tokens, `temperature=0`, `top_p=1`.

| Quant | Mode | Aggregate tok/s | Speedup vs target-only | Draft acceptance | Mean accepted length | Acceptance by position |
|---|---:|---:|---:|---:|---:|---|
| IQ4_XS-MTP | target-only | 224.59 | 1.00x | n/a | n/a | n/a |
| IQ4_XS-MTP | `draft-mtp`, `n_max=2` | 275.03 | 1.22x | 76.51% | 2.52 | `(0.830, 0.692)` |
| IQ4_XS-MTP | `draft-mtp`, `n_max=1` | 259.58 | 1.16x | 86.47% | 1.86 | `(0.865)` |
| Q4_K_M-MTP | target-only | 230.48 | 1.00x | n/a | n/a | n/a |
| Q4_K_M-MTP | `draft-mtp`, `n_max=2` | 273.80 | 1.19x | 77.18% | 2.53 | `(0.847, 0.687)` |
| Q4_K_M-MTP | `draft-mtp`, `n_max=1` | 264.88 | 1.15x | 91.46% | 1.91 | `(0.915)` |

Recommended low-latency/single-user throughput profile: `SPEC_DRAFT_N_MAX=2`.
Recommended high-acceptance fallback: `SPEC_DRAFT_N_MAX=1`.

Detailed MTP evidence is in:

- `reports/agents-a1-mtp-q4-profile-summary.md`
- `reports/agents-a1-mtp-q4-profile-summary.json`
- `configs/mtp_profiles.yaml`

## Usage

Example with the recommended compact quant:

```bash
llama-server \
  -m agents-a1-IQ4_XS.gguf \
  -ngl 99 \
  -c 8192 \
  -b 4096 \
  -ub 512 \
  --flash-attn on
```

NVFP4 example:

```bash
llama-server \
  -m agents-a1-NVFP4.gguf \
  -ngl 99 \
  -c 8192 \
  -b 4096 \
  -ub 512 \
  --flash-attn on
```

The NVFP4 artifact is a standard GGUF using the `NVFP4` tensor type, but runtime support is still newer and less universal than K-quants or IQ4_XS. It was tested on a Blackwell GPU with a llama.cpp build reporting `BLACKWELL_NATIVE_FP4 = 1`.

MTP example:

```bash
LLAMA_SPEC_MAX_DRAFTING_SLOTS=1 \
LLAMA_MTP_FAST_BACKEND_SAMPLE=1 \
LLAMA_MTP_DRAFT_TOP_K=1 \
LLAMA_MTP_DRAFT_TOP_P=1 \
LLAMA_MTP_DRAFT_TEMP=1 \
llama-server \
  -m agents-a1-IQ4_XS-MTP-graft-headQ6.gguf \
  -ngl 99 \
  -c 8192 \
  -b 4096 \
  -ub 512 \
  --flash-attn on \
  --reasoning off \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --spec-draft-n-min 0 \
  --spec-draft-backend-sampling
```

For the high-acceptance profile, change `--spec-draft-n-max 2` to
`--spec-draft-n-max 1`.

## Vision / mmproj

The release includes one shared multimodal projector:

- `mmproj-agents-a1-bf16.gguf`
- `processor_config.json`
- `preprocessor_config.json`
- `video_preprocessor_config.json`

The mmproj was converted from the original `InternScience/Agents-A1` Hugging
Face checkpoint with llama.cpp `convert_hf_to_gguf.py --mmproj --outtype bf16`.
It contains the Qwen3VL vision tower/projector and is independent of the text
quantization level, so the same file is intended for Q4-class and higher text
GGUFs:

- `agents-a1-IQ4_XS.gguf`
- `agents-a1-IQ4_XS-MTP-graft-headQ6.gguf`
- `agents-a1-NVFP4.gguf`
- `agents-a1-Q4_K_M.gguf`
- `agents-a1-Q4_K_M-MTP-graft-headQ6.gguf`
- `agents-a1-Q5_K_M.gguf`
- `agents-a1-Q6_K.gguf`
- `agents-a1-Q8_0.gguf`

`Q3_K_M` may load with the same mmproj, but it is not the recommended vision
profile because image tasks are more sensitive to text-model quantization.

Example with llama.cpp's multimodal CLI:

```bash
llama-mtmd-cli \
  -m agents-a1-Q4_K_M.gguf \
  --mmproj mmproj-agents-a1-bf16.gguf \
  --image image.jpg \
  -p "Describe the image." \
  -ngl 99 \
  -c 4096 \
  -b 1024 \
  -ub 256 \
  --chat-template chatml \
  --image-min-tokens 1024 \
  --flash-attn on
```

If your llama.cpp `llama-server` build has multimodal support enabled, the same
mmproj can be passed with `--mmproj mmproj-agents-a1-bf16.gguf`.

Local smoke test:

| Text GGUF | Image | Prompt | Expected | Answer | Verified |
|---|---|---|---|---|---:|
| `agents-a1-Q4_K_M.gguf` | llama.cpp `tools/mtmd/test-1.jpeg` | `Look at the newspaper image. What is the main headline? Answer only with the headline text.` | `MEN WALK ON MOON` | `MEN WALK ON MOON` | true |

Verification report: `reports/mmproj-q4km-actual-image-verify.json`.

## MTP Status

The original upstream snapshot remains config-only for MTP; see
`reports/mtp-weights-audit.json`. The new `*-MTP-graft-headQ6.gguf` files are
true integrated MTP GGUFs built from the Agents-A1 MTPLX MTP sidecar.

## Provenance

- Base model: `InternScience/Agents-A1`
- License: Apache-2.0, inherited from the base model
- Quantization source: BF16 GGUF converted from the Hugging Face checkpoint
- MTP source: `wang-yang/Agents-A1-MTPLX-Q4` sidecar grafted onto the base Agents-A1 checkpoint
- Calibration: coding/instruction chat data rendered with the model chat template
- Quantizer: patched llama.cpp with Qwen3.5-MoE and NVFP4 support