How to use from
Docker Model Runner
docker model run hf.co/r0b0tlab/Agents-A1-NVFP4
Quick Links

Agents-A1 NVFP4

This repository contains an NVIDIA ModelOpt NVFP4 quantization of InternScience/Agents-A1, a 35B Qwen3.5 MoE agentic model.

Credits and Attribution

This NVFP4 checkpoint is derived from InternScience/Agents-A1.

  • Base model: InternScience, for the Agents-A1 model, training recipe, technical report, and original BF16 Hugging Face release.
  • Quantization tooling: NVIDIA, for NVIDIA TensorRT Model Optimizer / NVIDIA ModelOpt, used to produce the NVFP4 ModelOpt checkpoint.
  • Model architecture and runtime ecosystem: Hugging Face Transformers, Safetensors, Accelerate, and the Hugging Face Hub.
  • Calibration data: CNN/DailyMail via Hugging Face Datasets, used for text-path post-training calibration.
  • Inference ecosystem: vLLM/SGLang compatibility is inherited from the Qwen3.5 MoE / ModelOpt NVFP4 ecosystem, subject to runtime support and validation.

Quantization Summary

Field Value
Base model InternScience/Agents-A1
Quantization tool NVIDIA ModelOpt 0.44.0
Quantization format NVFP4 / ModelOpt FP4
ModelOpt config mtq.NVFP4_MLP_ONLY_CFG
Calibration data abisee/cnn_dailymail, text-only calibration
Calibration sequence length 1024
Architecture Qwen3_5MoeForConditionalGeneration
License Apache-2.0, following the base model

Quantization Policy

Agents-A1 is a hybrid Qwen3.5 MoE model with 30 linear_attention layers, 10 full-attention layers, 256 experts per layer, and a vision tower. This checkpoint uses an MLP/MoE-only NVFP4 policy for the first release.

The following module families were explicitly excluded from NVFP4 quantization and preserved in BF16:

[
  "*visual*",
  "*vision*",
  "*patch_embed*",
  "*pos_embed*",
  "*merger*",
  "*linear_attn*",
  "*linear_attention*",
  "*self_attn*",
  "*attn*",
  "*embed_tokens*",
  "*lm_head*",
  "*mtp*"
]

Rationale:

  • The MoE/MLP expert layers are the largest parameter family and are the correct target for NVFP4 compression.
  • The GDN/linear_attn path is not standard dense transformer attention and is excluded for compatibility.
  • Vision modules are preserved to avoid multimodal degradation from text-only calibration.
  • Embeddings, lm_head, and MTP-sensitive modules are preserved in BF16.

Files

  • hf_quant_config.json — ModelOpt quantization metadata used by compatible inference engines.
  • modelopt_exclusions.json — exact exclusion list used during quantization.
  • config.json, tokenizer, and processor files are copied from the base model and patched only as required for export consistency.

Validation Status

This release is a quantized checkpoint, not a new fine-tune. It does not claim quality improvement over BF16.

Runtime smoke testing on NVIDIA GB10 / SM121 completed with the companion container recipe at r0b0tlab/agents-a1-nvfp4-sm121-vllm.

Validated evidence includes:

  1. Container audit on NVIDIA GB10 with CUDA capability [12, 1].
  2. vLLM extension imports: vllm._C, vllm._C_stable_libtorch, vllm._moe_C.
  3. Native FP4 support checks: cutlass_scaled_mm_supports_fp4(121) and (120) return true.
  4. Runtime log selection of FlashInferCutlassNvFp4LinearKernel and FLASHINFER_CUTLASS for NVFP4/MoE.
  5. OpenAI-compatible /v1/models and /v1/chat/completions probes against the running container.
  6. Lightweight live-container benchmark evidence in the companion repo: GSM8K 50-question lm-eval run at 98.00% exact match, direct HumanEval 50-question run at 48/50 (96.00%), c1/c2/c4/c8 concurrency sweep with 100% request success, and GPU telemetry including power draw.

Benchmark snapshot

The benchmark run agents-a1-nvfp4-gsm8k50-humaneval50-20260701T194211Z used the live OpenAI-compatible endpoint at http://127.0.0.1:18080/v1 with chat_template_kwargs.enable_thinking=false for scored requests.

Suite Harness Samples Result Notes
GSM8K lm-eval gsm8k 50 strict 98.00%, flexible 98.00% num_concurrent=2
HumanEval direct OpenAI-compatible evaluator 50 48/50 (96.00%) code extracted/evaluated locally
HumanEval stock lm-eval humaneval 50 0.00% preserved as harness-interference evidence; stock stop rules truncate chat-model output

Combined telemetry across GSM8K, HumanEval, and direct HumanEval averaged 27.88 W GPU power draw, 70.75% GPU utilization, and 58.83°C, with maxima of 36.00 W, 96.00%, and 65.00°C over 166 telemetry samples. The c8 concurrency sweep completed 24/24 requests successfully. See the companion repo's benchmarks/agents-a1-nvfp4-gsm8k50-humaneval50-20260701T194211Z/ directory for raw logs, samples, summaries, telemetry CSVs, and MANIFEST.sha256.

SM121 container quick start

docker run --rm --gpus all --ipc=host \
  --name agents-a1-nvfp4-vllm \
  -p 18080:8000 \
  -e MODEL_ID=r0b0tlab/Agents-A1-NVFP4 \
  ghcr.io/r0b0tlab/agents-a1-nvfp4-sm121-vllm:latest

For fully pinned local reproduction, clone/download this model and mount it read-only:

docker run --rm --gpus all --ipc=host \
  --name agents-a1-nvfp4-vllm \
  -p 18080:8000 \
  -v /path/to/Agents-A1-NVFP4:/models/Agents-A1-NVFP4:ro \
  ghcr.io/r0b0tlab/agents-a1-nvfp4-sm121-vllm:latest

Limitations

  • Calibration is text-only; vision components are preserved in BF16 rather than calibrated.
  • This card does not claim benchmark parity until BF16-vs-NVFP4 evaluations are published.
  • Runtime support depends on the inference engine's ModelOpt/NVFP4 implementation.

Citation

@misc{internscience_agents_a1_2026,
  title = {Agents-A1: Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent},
  author = {InternScience},
  year = {2026},
  url = {https://huggingface.co/InternScience/Agents-A1}
}

License

This quantized checkpoint follows the base model license, Apache-2.0. Users must also comply with the licenses and terms for the base model, calibration data, NVIDIA ModelOpt, Hugging Face libraries, and any inference runtime used.

Downloads last month
286
Safetensors
Model size
19B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for r0b0tlab/Agents-A1-NVFP4

Quantized
(47)
this model