How to use from
Pi
Start the MLX server
# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "leonsarmiento/Agents-A1-5bit-XL-mlx"
Configure the model in Pi
# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "leonsarmiento/Agents-A1-5bit-XL-mlx"
        }
      ]
    }
  }
}
Run Pi
# Start Pi in your project directory:
pi
Quick Links

leonsarmiento/Agents-A1-5bit-XL-mlx

This model was converted to MLX format from InternScience/Agents-A1 using BaseQuant_XL 5/8-bit mixed quantization optimized for Apple Silicon. The vision encoder is preserved and quantized at 5-bit, making this a full multimodal model.

BaseQuant_XL keeps the most routing-critical layers in full bf16 precision — the MoE router gate, shared expert gate, shared expert, and lm_head — while applying aggressive quantization to the bulk parameters. This preserves routing accuracy and output quality where it matters most.

Agents-A1 is a 35B Mixture-of-Experts agentic model built to scale heterogeneous agentic abilities across multiple domains including Long-horizon Search, Engineering, Scientific Research, Instruction Following, and Tool-calling. It features 256 experts (8 active per token + 1 shared expert), hybrid full + linear (Gated DeltaNet) attention, a vision encoder, and an extended 262K context window. Despite 35B total parameters, only ~3B are activated per token.

Use with mlx

pip install -U mlx-vlm
python -m mlx_vlm.generate --model leonsarmiento/Agents-A1-5bit-XL-mlx --max-tokens 256 --temperature 0.85 --top-p 0.95 --top-k 20 --min-p 0.01 --repeat-penalty 1.05 --prompt "Hello"

BaseQuant_XL Quantization Strategy

Bit Depth Layers Rationale
bf16 (unquantized) mlp.gate (router), shared_expert_gate, lm_head, shared_expert Routing decisions and shared computation path — errors here are qualitatively different from precision loss
8-bit embed_tokens, self_attn (full attention), linear_attn (DeltaNet) Every-token layers with moderate sensitivity — 8-bit is near-lossless
5-bit vision_tower, switch_mlp (routed experts) Bulk of parameters, only 8 of 256 experts active per token — natural redundancy tolerates lower precision

Quantization Details

Layer Bits Group Size
mlp.gate (router) bf16
shared_expert_gate bf16
lm_head bf16
shared_expert bf16
embed_tokens 8 64
self_attn (full attention) 8 64
linear_attn (DeltaNet) 8 64
vision_tower 5 64
switch_mlp (routed experts) 5 64
Default fallback 8 64
  • Quantization type: BaseQuant_XL mixed (multimodal, vision preserved)
  • Bits per weight: 5.881
  • Total size: ~24 GB (5 shards)
  • Group size: 64
  • Method: Custom quant_predicate via mlx_vlm

Recommended Inference Parameters

Parameter Value
temperature 0.85
top_p 0.95
top_k 20
min_p 0.01
repeat_penalty 1.05
presence_penalty 1.1

Reasoning and Tool-Call Parsing

Parser Value
reasoning_parser qwen3
tool_call_parser qwen3_coder
Downloads last month
132
Safetensors
Model size
7B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for leonsarmiento/Agents-A1-5bit-XL-mlx

Quantized
(47)
this model

Collection including leonsarmiento/Agents-A1-5bit-XL-mlx