--- license: apache-2.0 base_model: - InternScience/Agents-A1 library_name: llama.cpp pipeline_tag: text-generation tags: - gguf - quantized - llama-cpp - qwen3.5-moe - mixture-of-experts - agents-a1 - nvfp4 - mtp - speculative-decoding - mmproj - multimodal - vision - qwen3vl --- # Agents-A1 GGUF Quants High quality GGUF quantizations of [InternScience/Agents-A1](https://huggingface.co/InternScience/Agents-A1), a 35B Qwen3.5-MoE agent model. These files were produced from the BF16 Hugging Face checkpoint with a patched llama.cpp build that supports the `qwen35moe` architecture. The calibration pass used an importance matrix built from coding/instruction chat data, then each quant was benchmarked against the BF16 GGUF reference. ## Recommended Files | Use case | File | Notes | |---|---|---| | Best small general-purpose quant | `agents-a1-IQ4_XS.gguf` | Strong quality for size, broad llama.cpp compatibility. | | Best single-user MTP throughput | `agents-a1-IQ4_XS-MTP-graft-headQ6.gguf` | IQ4_XS body with Q6_K MTP block; measured 1.22x over target-only in c1/128 chat serving. | | Highest MTP acceptance in this run | `agents-a1-Q4_K_M-MTP-graft-headQ6.gguf` with `SPEC_DRAFT_N_MAX=1` | 91.46% draft acceptance while still 1.15x over target-only. | | Vision / image input for Q4+ quants | `mmproj-agents-a1-bf16.gguf` | Shared BF16 Qwen3VL mmproj for IQ4_XS, Q4_K_M, Q5_K_M, Q6_K, Q8_0, NVFP4, and the Q4 MTP variants. | | Fast Blackwell FP4 path | `agents-a1-NVFP4.gguf` | Tested on RTX PRO 6000 Blackwell. Requires runtime support for `GGML_TYPE_NVFP4`. | | Safer quality step up | `agents-a1-Q5_K_M.gguf` | Lower KLD than IQ4_XS with larger size. | | Closest to BF16 by KLD | `agents-a1-Q6_K.gguf` | Best KLD in this eval set. | | High precision archival quant | `agents-a1-Q8_0.gguf` | Largest quantized file. | ## Files | Quant | File size | Notes | |---|---:|---| | Q3_K_M | 16.76 GB | Smallest included quant. | | IQ4_XS | 18.73 GB | Recommended compact quant. | | IQ4_XS-MTP-graft-headQ6 | 19.42 GB | IQ4_XS body plus integrated Q6_K/F32 MTP block. | | NVFP4 | 19.72 GB | Blackwell-oriented FP4 GGUF, output head kept at Q6_K by quality rule. | | Q4_K_M | 21.17 GB | Standard K-quant. | | Q4_K_M-MTP-graft-headQ6 | 21.86 GB | Q4_K_M body plus integrated Q6_K/F32 MTP block. | | Q5_K_M | 24.73 GB | Strong quality/size tradeoff. | | Q6_K | 28.51 GB | Lowest mean KLD in this run. | | Q8_0 | 36.90 GB | Highest precision quant. | | mmproj BF16 | 0.90 GB | Shared Qwen3VL vision encoder/projector for Q4-class and higher text GGUFs. | ## Metrics Hardware and runtime profile: - GPU: single NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, full offload - llama.cpp flags: `-ngl 99 -sm none -fa on -p 512 -n 128 -b 4096 -ub 512 -r 3` - PPL: `llama-perplexity`, context 2048, 64 rendered eval conversations, 3 chunks - KLD: approximate `KL(P_BF16 || P_quant)` over top-64 next-token distributions on 32 prompts The PPL eval is intentionally small, so treat PPL deltas as directional. KLD and top-1 agreement are more useful here for quant-to-BF16 comparison. | Model | Size GB | Prompt tok/s | Gen tok/s | PPL | PPL delta | KLD mean | KLD p95 | Top-1 match | |---|---:|---:|---:|---:|---:|---:|---:|---:| | BF16 reference | 69.38 | 3418.9 | 161.8 | 1.3031 | 0.0000 | 0.0000 | 0.0000 | 32/32 | | Q3_K_M | 16.76 | 6779.5 | 269.0 | 1.3101 | +0.0070 | 0.0655 | 0.2155 | 28/32 | | IQ4_XS | 18.73 | 7719.5 | 258.1 | 1.3038 | +0.0007 | 0.0151 | 0.0654 | 29/32 | | NVFP4 | 19.72 | 9064.0 | 265.1 | 1.3063 | +0.0032 | 0.0420 | 0.1473 | 31/32 | | Q4_K_M | 21.17 | 7230.8 | 262.6 | 1.3016 | -0.0015 | 0.1225 | 0.3349 | 27/32 | | Q5_K_M | 24.73 | 7021.4 | 257.9 | 1.3041 | +0.0010 | 0.0091 | 0.0335 | 30/32 | | Q6_K | 28.51 | 6294.0 | 244.6 | 1.3040 | +0.0009 | 0.0049 | 0.0178 | 32/32 | | Q8_0 | 36.90 | 7431.3 | 222.7 | 1.3036 | +0.0005 | 0.0053 | 0.0063 | 30/32 | ### Charts ![Size vs generation speed](metrics/chart-size-vs-generation.png) ![Mean KLD](metrics/chart-kld-mean.png) ![PPL delta](metrics/chart-ppl-delta.png) ![Quality vs size](metrics/chart-quality-vs-size.png) Raw metric files are in `metrics/`; KLD reports, checksums, and the MTP audit are in `reports/`. ## MTP Q4 Variants The upstream Agents-A1 checkpoint used for the first GGUF release advertises MTP in config but does not ship `mtp.*`/`blk.40.*` tensors. The two MTP Q4 variants here graft in the Agents-A1 MTPLX MTP sidecar from `wang-yang/Agents-A1-MTPLX-Q4`, then convert it with llama.cpp's Qwen3.5-MoE MTP path. The dense MTP block is preserved at Q6_K while the model body is quantized to IQ4_XS or Q4_K_M. Structural checks for both MTP GGUFs: | Check | Value | |---|---:| | GGUF tensors | 753 | | `qwen35moe.block_count` | 41 | | `qwen35moe.nextn_predict_layers` | 1 | | `blk.40.*` MTP tensors | 20 | | `blk.40.nextn.*` tensors | 4 | Single-user serving profile: one RTX PRO 6000 Blackwell Max-Q 96 GB GPU, `PARALLEL=1`, `CTX_SIZE=8192`, streaming chat completions, `12` requests, `128` max tokens, `temperature=0`, `top_p=1`. | Quant | Mode | Aggregate tok/s | Speedup vs target-only | Draft acceptance | Mean accepted length | Acceptance by position | |---|---:|---:|---:|---:|---:|---| | IQ4_XS-MTP | target-only | 224.59 | 1.00x | n/a | n/a | n/a | | IQ4_XS-MTP | `draft-mtp`, `n_max=2` | 275.03 | 1.22x | 76.51% | 2.52 | `(0.830, 0.692)` | | IQ4_XS-MTP | `draft-mtp`, `n_max=1` | 259.58 | 1.16x | 86.47% | 1.86 | `(0.865)` | | Q4_K_M-MTP | target-only | 230.48 | 1.00x | n/a | n/a | n/a | | Q4_K_M-MTP | `draft-mtp`, `n_max=2` | 273.80 | 1.19x | 77.18% | 2.53 | `(0.847, 0.687)` | | Q4_K_M-MTP | `draft-mtp`, `n_max=1` | 264.88 | 1.15x | 91.46% | 1.91 | `(0.915)` | Recommended low-latency/single-user throughput profile: `SPEC_DRAFT_N_MAX=2`. Recommended high-acceptance fallback: `SPEC_DRAFT_N_MAX=1`. Detailed MTP evidence is in: - `reports/agents-a1-mtp-q4-profile-summary.md` - `reports/agents-a1-mtp-q4-profile-summary.json` - `configs/mtp_profiles.yaml` ## Usage Example with the recommended compact quant: ```bash llama-server \ -m agents-a1-IQ4_XS.gguf \ -ngl 99 \ -c 8192 \ -b 4096 \ -ub 512 \ --flash-attn on ``` NVFP4 example: ```bash llama-server \ -m agents-a1-NVFP4.gguf \ -ngl 99 \ -c 8192 \ -b 4096 \ -ub 512 \ --flash-attn on ``` The NVFP4 artifact is a standard GGUF using the `NVFP4` tensor type, but runtime support is still newer and less universal than K-quants or IQ4_XS. It was tested on a Blackwell GPU with a llama.cpp build reporting `BLACKWELL_NATIVE_FP4 = 1`. MTP example: ```bash LLAMA_SPEC_MAX_DRAFTING_SLOTS=1 \ LLAMA_MTP_FAST_BACKEND_SAMPLE=1 \ LLAMA_MTP_DRAFT_TOP_K=1 \ LLAMA_MTP_DRAFT_TOP_P=1 \ LLAMA_MTP_DRAFT_TEMP=1 \ llama-server \ -m agents-a1-IQ4_XS-MTP-graft-headQ6.gguf \ -ngl 99 \ -c 8192 \ -b 4096 \ -ub 512 \ --flash-attn on \ --reasoning off \ --spec-type draft-mtp \ --spec-draft-n-max 2 \ --spec-draft-n-min 0 \ --spec-draft-backend-sampling ``` For the high-acceptance profile, change `--spec-draft-n-max 2` to `--spec-draft-n-max 1`. ## Vision / mmproj The release includes one shared multimodal projector: - `mmproj-agents-a1-bf16.gguf` - `processor_config.json` - `preprocessor_config.json` - `video_preprocessor_config.json` The mmproj was converted from the original `InternScience/Agents-A1` Hugging Face checkpoint with llama.cpp `convert_hf_to_gguf.py --mmproj --outtype bf16`. It contains the Qwen3VL vision tower/projector and is independent of the text quantization level, so the same file is intended for Q4-class and higher text GGUFs: - `agents-a1-IQ4_XS.gguf` - `agents-a1-IQ4_XS-MTP-graft-headQ6.gguf` - `agents-a1-NVFP4.gguf` - `agents-a1-Q4_K_M.gguf` - `agents-a1-Q4_K_M-MTP-graft-headQ6.gguf` - `agents-a1-Q5_K_M.gguf` - `agents-a1-Q6_K.gguf` - `agents-a1-Q8_0.gguf` `Q3_K_M` may load with the same mmproj, but it is not the recommended vision profile because image tasks are more sensitive to text-model quantization. Example with llama.cpp's multimodal CLI: ```bash llama-mtmd-cli \ -m agents-a1-Q4_K_M.gguf \ --mmproj mmproj-agents-a1-bf16.gguf \ --image image.jpg \ -p "Describe the image." \ -ngl 99 \ -c 4096 \ -b 1024 \ -ub 256 \ --chat-template chatml \ --image-min-tokens 1024 \ --flash-attn on ``` If your llama.cpp `llama-server` build has multimodal support enabled, the same mmproj can be passed with `--mmproj mmproj-agents-a1-bf16.gguf`. Local smoke test: | Text GGUF | Image | Prompt | Expected | Answer | Verified | |---|---|---|---|---|---:| | `agents-a1-Q4_K_M.gguf` | llama.cpp `tools/mtmd/test-1.jpeg` | `Look at the newspaper image. What is the main headline? Answer only with the headline text.` | `MEN WALK ON MOON` | `MEN WALK ON MOON` | true | Verification report: `reports/mmproj-q4km-actual-image-verify.json`. ## MTP Status The original upstream snapshot remains config-only for MTP; see `reports/mtp-weights-audit.json`. The new `*-MTP-graft-headQ6.gguf` files are true integrated MTP GGUFs built from the Agents-A1 MTPLX MTP sidecar. ## Provenance - Base model: `InternScience/Agents-A1` - License: Apache-2.0, inherited from the base model - Quantization source: BF16 GGUF converted from the Hugging Face checkpoint - MTP source: `wang-yang/Agents-A1-MTPLX-Q4` sidecar grafted onto the base Agents-A1 checkpoint - Calibration: coding/instruction chat data rendered with the model chat template - Quantizer: patched llama.cpp with Qwen3.5-MoE and NVFP4 support