Instructions to use mlx-community/Agents-A1-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/Agents-A1-8bit with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("mlx-community/Agents-A1-8bit") config = load_config("mlx-community/Agents-A1-8bit") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use mlx-community/Agents-A1-8bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mlx-community/Agents-A1-8bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "mlx-community/Agents-A1-8bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use mlx-community/Agents-A1-8bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mlx-community/Agents-A1-8bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default mlx-community/Agents-A1-8bit
Run Hermes
hermes
Configure the model in Pi
# Install Pi:
npm install -g @mariozechner/pi-coding-agent# Add to ~/.pi/agent/models.json:
{
"providers": {
"mlx-lm": {
"baseUrl": "http://localhost:8080/v1",
"api": "openai-completions",
"apiKey": "none",
"models": [
{
"id": "mlx-community/Agents-A1-8bit"
}
]
}
}
}Run Pi
# Start Pi in your project directory:
piAgents-A1 — MLX (8-bit)
MLX 8-bit quantization of InternScience/Agents-A1 (affine, group size 64). The source is bf16; this is a uniform mlx quantization.
Agents-A1 is a Qwen3.5-MoE vision-language agent model (qwen3_5_moe, Qwen3_5MoeForConditionalGeneration): 40 decoder layers, 256 routed experts per layer + a shared expert, hidden size 2048, with a vision tower and video preprocessing.
Running it
Multimodal (VLM) — load with mlx-vlm (mlx-lm can't load multimodal architectures):
pip install mlx-vlm
python -m mlx_vlm.generate --model mlx-community/Agents-A1-8bit \
--prompt "What is 17 * 24? Think step by step." --max-tokens 512
# with an image:
python -m mlx_vlm.generate --model mlx-community/Agents-A1-8bit --image img.jpg --prompt "Describe this image."
Loads and runs in stock mlx-vlm — no patched code needed at inference.
Conversion notes
I first tried oMLX's data-driven oQ quantization, but it doesn't work for this checkpoint: oQ writes the MoE experts in a per-expert layout that omlx's own loader can't read back (parameters not in model), so the quantized model fails to load. This build therefore uses standard mlx-vlm quantization instead — uniform 8-bit, group size 64 — which loads cleanly in both stock mlx-vlm & oMLX.
Throughput
Measured with oMLX's benchmark harness on a Macbook Pro M5 Max 128GB 40 GPU — gen 128 tokens, cold prefill (unique prompt prefix per request, no cache reuse).
Single request (batch 1) — decode tok/s by context
| Context | bf16 | 8-bit | 6-bit | 5-bit | 4-bit | 3-bit |
|---|---|---|---|---|---|---|
| 1,024 | 67.6 | 95.4 | 95.2 | 98.2 | 117.4 | 133.0 |
| 4,096 | 67.6 | 94.0 | 97.3 | 102.8 | 119.5 | 130.4 |
| 8,192 | 66.8 | 91.7 | 95.3 | 103.1 | 115.7 | 126.9 |
| 16,384 | 64.7 | 88.0 | 91.5 | 80.5 | 105.8 | 119.8 |
| 32,768 | 60.9 | 80.6 | 88.6 | 80.2 | 95.6 | 104.2 |
| 65,536 | 53.5 | 68.4 | 67.6 | 66.6 | 75.4 | 83.5 |
| 131,072 | 40.7 | 48.7 | 50.9 | 48.2 | 50.3 | 52.5 |
| Peak RAM (GB) | 66–69 | 35–39 | 27–31 | 23–26 | 19–22 | 15–18 |
TTFT (cold prefill) is ~precision-independent — ≈0.3 s @1k, 3 s @8k, 21 s @32k, 63 s @64k, ~225 s @128k — prefill is compute-bound, not weight-bound.
Continuous batching (1k context) — aggregate decode tok/s
| Batch | bf16 | 8-bit | 6-bit | 5-bit | 4-bit | 3-bit |
|---|---|---|---|---|---|---|
| 1 | 67.6 | 95.4 | 95.2 | 98.2 | 117.4 | 133.0 |
| 2 | 62.5 | 151.0 | 156.5 | 160.6 | 190.9 | 188.7 |
| 4 | 107.1 | 202.0 | 185.1 | 195.7 | 239.9 | 230.2 |
| 8 | 129.6 | 252.4 | 223.4 | 238.7 | 289.0 | 276.1 |
Aggregate across the batch; per-request rate is that value divided by the batch size.
Smoke test
17 x 24 -> correct (408), coherent, no repetition.
Other precisions
| Precision | Repo | Size on disk |
|---|---|---|
| bf16 (full) | Agents-A1-bf16 | ~65 GB |
| 8-bit | Agents-A1-8bit | ~35 GB |
| 6-bit | Agents-A1-6bit | ~27 GB |
| 5-bit | Agents-A1-5bit | ~23 GB |
| 4-bit | Agents-A1-4bit | ~19 GB |
| 3-bit | Agents-A1-3bit | ~15 GB |
License
apache-2.0, inherited from the base model.
- Downloads last month
- -
8-bit
Model tree for mlx-community/Agents-A1-8bit
Base model
InternScience/Agents-A1
Start the MLX server
# Install MLX LM: uv tool install mlx-lm# Start a local OpenAI-compatible server: mlx_lm.server --model "mlx-community/Agents-A1-8bit"