How to use from
MLX LM
Generate or start a chat session
# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "wang-yang/Agents-A1-MTPLX-Q4"
Run an OpenAI-compatible server
# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "wang-yang/Agents-A1-MTPLX-Q4"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "wang-yang/Agents-A1-MTPLX-Q4",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'
Quick Links

Agents-A1-MTPLX-Q4

4-bit quantized MLX version of InternScience/Agents-A1 with grafted MTP (Multi-Token Prediction) head for speculative decoding on Apple Silicon.

Model Details

  • Base model: InternScience/Agents-A1 (Qwen3.5-MoE architecture, 35B total / 3B active parameters)
  • Quantization: 4-bit affine (group size 64), router gates at 8-bit
  • MTP head: Grafted from Qwen3.5-35B-A3B (4-bit quantized, 1 layer)
  • Format: MLX safetensors
  • Disk size: ~18 GB (model) + 1.6 GB (MTP sidecar)

Architecture

  • Hidden size: 2048
  • Layers: 40 (hybrid linear + full attention)
  • Experts: 256 total, 8 active per token
  • Vocab: 248,320
  • Context: 262,144 tokens

Usage with MTPLX

mtplx start --model wang-yang/Agents-A1-MTPLX-Q4

Usage with mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("wang-yang/Agents-A1-MTPLX-Q4")
prompt = "<|im_start|>user\nHello!<|im_end|>\n<|im_start|>assistant\n"
result = generate(model, tokenizer, prompt=prompt, max_tokens=200)

Notes

  • EOS token: <|im_end|> (id 248046)
  • MTP speculative decoding: ~1.33x speedup (D2 best, 101.8 tok/s vs AR 76.6 tok/s on M3 Max 128GB).

Files

File Description
model-0000X-of-00004.safetensors Quantized model weights (4 shards)
mtp.safetensors MTP draft head weights (4-bit quantized)
config.json Model architecture + quantization config
tokenizer.json Tokenizer vocabulary
tokenizer_config.json Tokenizer settings
chat_template.jinja Chat template (no thinking mode)
Downloads last month
128
Safetensors
Model size
35B params
Tensor type
BF16
U32
F32
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for wang-yang/Agents-A1-MTPLX-Q4

Quantized
(39)
this model