How to use from the
Use from the
MLX library
# Make sure mlx-vlm is installed
# pip install --upgrade mlx-vlm

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load the model
model, processor = load("spicyneuron/Qwen3.5-35B-A3B-MLX-4.9bit-vision")
config = load_config("spicyneuron/Qwen3.5-35B-A3B-MLX-4.9bit-vision")

# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."

# Apply chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=1
)

# Generate output
output = generate(model, processor, formatted_prompt, image)
print(output)

Qwen3.5-35B-A3B optimized for MLX. This quant supports image input and requires a vision-enabled MLX server.

For the non-vision model: https://huggingface.co/spicyneuron/Qwen3.5-35B-A3B-MLX-4.8bit

EDIT: Updated chat template to enable better prompt caching.

Usage

# Start server at http://localhost:8080/chat/completions
uvx --from mlx-vlm --with torchvision \
  mlx_vlm.server \
  --host 127.0.0.1 \
  --port 8080 \
  --model spicyneuron/Qwen3.5-35B-A3B-MLX-4.9bit-vision

Methodology

Quantized using a custom script inspired by Unsloth/AesSedai/ubergarm style mixed-precision GGUFs. MLX quantization options differ than llama.cpp, but the principles are the same:

  • Sensitive layers like MoE routing, attention, and output embeddings get higher precision
  • More tolerant layers like MoE experts get lower precision
Downloads last month
27
Safetensors
Model size
6B params
Tensor type
BF16
U32
F32
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for spicyneuron/Qwen3.5-35B-A3B-MLX-4.9bit-vision

Finetuned
(127)
this model