How to use from the
Use from the
MLX library
# Make sure mlx-vlm is installed
# pip install --upgrade mlx-vlm

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load the model
model, processor = load("mlx-community/Agents-A1-8bit")
config = load_config("mlx-community/Agents-A1-8bit")

# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."

# Apply chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=1
)

# Generate output
output = generate(model, processor, formatted_prompt, image)
print(output)

Agents-A1 — MLX (8-bit)

MLX 8-bit quantization of InternScience/Agents-A1 (affine, group size 64). The source is bf16; this is a uniform mlx quantization.

Agents-A1 is a Qwen3.5-MoE vision-language agent model (qwen3_5_moe, Qwen3_5MoeForConditionalGeneration): 40 decoder layers, 256 routed experts per layer + a shared expert, hidden size 2048, with a vision tower and video preprocessing.

Running it

Multimodal (VLM) — load with mlx-vlm (mlx-lm can't load multimodal architectures):

pip install mlx-vlm
python -m mlx_vlm.generate --model mlx-community/Agents-A1-8bit \
  --prompt "What is 17 * 24? Think step by step." --max-tokens 512
# with an image:
python -m mlx_vlm.generate --model mlx-community/Agents-A1-8bit --image img.jpg --prompt "Describe this image."

Loads and runs in stock mlx-vlm — no patched code needed at inference.

Conversion notes

I first tried oMLX's data-driven oQ quantization, but it doesn't work for this checkpoint: oQ writes the MoE experts in a per-expert layout that omlx's own loader can't read back (parameters not in model), so the quantized model fails to load. This build therefore uses standard mlx-vlm quantization instead — uniform 8-bit, group size 64 — which loads cleanly in both stock mlx-vlm & oMLX.

Throughput

Measured with oMLX's benchmark harness on a Macbook Pro M5 Max 128GB 40 GPU — gen 128 tokens, cold prefill (unique prompt prefix per request, no cache reuse).

Single request (batch 1) — decode tok/s by context

Context bf16 8-bit 6-bit 5-bit 4-bit 3-bit
1,024 67.6 95.4 95.2 98.2 117.4 133.0
4,096 67.6 94.0 97.3 102.8 119.5 130.4
8,192 66.8 91.7 95.3 103.1 115.7 126.9
16,384 64.7 88.0 91.5 80.5 105.8 119.8
32,768 60.9 80.6 88.6 80.2 95.6 104.2
65,536 53.5 68.4 67.6 66.6 75.4 83.5
131,072 40.7 48.7 50.9 48.2 50.3 52.5
Peak RAM (GB) 66–69 35–39 27–31 23–26 19–22 15–18

TTFT (cold prefill) is ~precision-independent — ≈0.3 s @1k, 3 s @8k, 21 s @32k, 63 s @64k, ~225 s @128k — prefill is compute-bound, not weight-bound.

Continuous batching (1k context) — aggregate decode tok/s

Batch bf16 8-bit 6-bit 5-bit 4-bit 3-bit
1 67.6 95.4 95.2 98.2 117.4 133.0
2 62.5 151.0 156.5 160.6 190.9 188.7
4 107.1 202.0 185.1 195.7 239.9 230.2
8 129.6 252.4 223.4 238.7 289.0 276.1

Aggregate across the batch; per-request rate is that value divided by the batch size.

Smoke test

17 x 24 -> correct (408), coherent, no repetition.

Other precisions

Precision Repo Size on disk
bf16 (full) Agents-A1-bf16 ~65 GB
8-bit Agents-A1-8bit ~35 GB
6-bit Agents-A1-6bit ~27 GB
5-bit Agents-A1-5bit ~23 GB
4-bit Agents-A1-4bit ~19 GB
3-bit Agents-A1-3bit ~15 GB

License

apache-2.0, inherited from the base model.

Downloads last month
-
Safetensors
Model size
10B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/Agents-A1-8bit

Quantized
(39)
this model

Collection including mlx-community/Agents-A1-8bit