Qwen-AgentWorld-35B-A3B-oQ3.5

oQ3.5 (data-driven mixed-precision, ≈3.5 bpw) MLX quantization of Qwen/Qwen-AgentWorld-35B-A3B, produced with oMLX's quantize_oq_streaming.

For Apple Silicon. Runs in mlx-lm, oMLX, or any MLX app. Siblings: oQ4 (higher-quality, ≈4.6 bpw) and bf16 (full precision).

Notes

  • Text-only. The base checkpoint declares a vision_config and MTP heads in config.json, but ships no vision or mtp.* weights (693 tensors, 0 vision, 0 MTP) — both are vestigial skeleton inherited from the Qwen3.5 base. This quant is the faithful language model; nothing multimodal was dropped.
  • ≈16 GB on disk (from ≈69 GB bf16). Peak memory ≈17.6 GB generating.
  • Mixed-precision: per-layer bit allocation from oQ's sensitivity measurement; most weights are 4-bit-class with sensitive layers boosted.

Performance

Measured with oMLX (Auto engine) on an M5 Max (40-core GPU, 128 GB RAM). Single request:

Context (pp/tg) TTFT decode prefill peak mem
1024 / 128 461 ms 148 tok/s 2223 tok/s 17.3 GB
4096 / 128 1.17 s 139 tok/s 3510 tok/s 18.1 GB
8192 / 128 2.21 s 135 tok/s 3707 tok/s 18.4 GB
32768 / 128 11.5 s 118 tok/s 2856 tok/s 20.4 GB

Continuous batching (pp1024/tg128): 1×→148 · 2×→193 · 4×→265 · 8×→352 tok/s aggregate decode (2.37× at 8 concurrent requests).

Reference: BF16 source

Same setup, full-precision Qwen/Qwen-AgentWorld-35B-A3B:

Context (pp/tg) TTFT decode prefill peak mem
1024 / 128 644 ms 77 tok/s 1591 tok/s 65.6 GB
4096 / 128 1.68 s 76 tok/s 2434 tok/s 66.4 GB
8192 / 128 2.39 s 75 tok/s 3428 tok/s 66.7 GB
32768 / 128 12.0 s 67 tok/s 2730 tok/s 68.7 GB

Continuous batching (pp1024/tg128): 1×→77 · 2×→68 · 4×→114 · 8×→119 tok/s.

Takeaway: oQ3.5 gives ≈1.9× the single-request decode throughput at ≈¼ the memory (17 GB vs 66 GB), and scales better under batching (2.37× vs 1.55× at 8×).

Accuracy (quick reference)

A quick, non-representative sanity check — 100-question samples per benchmark with thinking enabled, run via oMLX's accuracy bench. Not enough to draw firm conclusions, but it gives a rough idea of how much quality the quant retains vs the BF16 source.

Benchmark BF16 oQ4 oQ3.5
MathQA 85.0% 84.0% 83.0%
MMLU-Pro 76.0% 77.0% 72.0%

oQ4 tracks BF16 within ≈1 pp; oQ3.5 trades a few more points (notably on MMLU-Pro) for smaller size and faster decode.

Usage

mlx_lm.generate --model mlx-community/Qwen-AgentWorld-35B-A3B-oQ3.5 \
  --system-prompt "You are a language world model simulating a Linux terminal. Given the user's command, predict the terminal output." \
  --prompt $'Action: execute_bash\nCommand: ls -la /home/user/project/' \
  --max-tokens 512 --temp 0.6

The model uses thinking mode (<think>...</think>) by default. Recommended sampling: temperature=0.6, top_p=0.95, top_k=20. See the base model card for the seven agent domains and domain-specific system prompts.

Downloads last month
30
Safetensors
Model size
5B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/Qwen-AgentWorld-35B-A3B-oQ3.5

Quantized
(36)
this model