Instructions to use andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-4bit with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-4bit") config = load_config("andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-4bit") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-4bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-4bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-4bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-4bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-4bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-4bit
Run Hermes
hermes
Qwen3.5-9B Distilled OPUS Heretic - MLX-VLM 4bit
4-bit quantized MLX-VLM conversion of an abliterated Qwen3.5-9B model distilled from Claude Opus 4.6 reasoning, optimized for Apple Silicon.
Size: ~5.6 GB | Bits/weight: 5.059 | Quality: Reduced; fastest and smallest variant
Background
This model starts from Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled, a Qwen3.5-9B base fine-tuned via knowledge distillation from Claude Opus 4.6 to replicate its chain-of-thought reasoning style.
Abliteration was applied using the technique from Arditi et al. (2024), adapted with a custom script to handle the hybrid DeltaNet/full-attention architecture. The result is a model that retains strong reasoning and vision capabilities while removing refusal behavior.
The model was then converted to MLX-VLM format and quantized to 4-bit for Apple Silicon inference. This variant offers the smallest footprint at some cost to output quality; prefer the 8-bit or fp16 variants where memory allows.
Architecture
- Type: Qwen3_5ForConditionalGeneration (multimodal)
- Layers: 32 total — 24 linear attention (DeltaNet) + 8 full attention
- Hidden size: 4096 | Intermediate size: 12288
- Vision encoder: 27-layer ViT
- Inputs: Text, images, video
Confirmed Capabilities
- Vision: Correctly describes image content
- Reasoning: Step-by-step mathematical problem solving (e.g., integration by parts)
- Uncensored: Responds to sensitive prompts without refusal
Usage
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model_path = "andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-4bit"
model, processor = load(model_path)
config = load_config(model_path)
# Text-only
prompt = apply_chat_template(processor, config, "Your question here", num_images=0)
result = generate(model, processor, prompt, max_tokens=500)
print(result.text)
# Vision
prompt = apply_chat_template(processor, config, "Describe this image", num_images=1)
result = generate(model, processor, prompt, max_tokens=500, image=["image.jpg"])
print(result.text)
Model Family
| Model | Size | Bits/Weight | Notes |
|---|---|---|---|
| andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-fp16 | ~4 GB | 16 | 2B, best quality |
| andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-8bit | ~2.1 GB | 8 | 2B, balanced |
| andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-4bit | ~1.2 GB | 4 | 2B, smallest |
| andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-fp16 | ~18 GB | 16 | 9B, best quality |
| andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-8bit | ~9.8 GB | 8.864 | 9B, balanced |
| andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-4bit | ~5.6 GB | 5.059 | This model |
Credits
- Base distillation: Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled
- Abliteration technique: Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction" (2024)
- MLX-VLM framework: Apple MLX-VLM
- Downloads last month
- 218
4-bit
Model tree for andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-4bit
Base model
Qwen/Qwen3.5-9B-Base