--- library_name: mlx license: apache-2.0 base_model: Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled pipeline_tag: image-text-to-text tags: - mlx - mlx-vlm - qwen3.5 - heretic - uncensored - abliterated - multimodal - vision --- # Qwen3.5-9B Distilled OPUS Heretic - MLX-VLM 8bit 8-bit quantized MLX-VLM conversion of an abliterated Qwen3.5-9B model distilled from Claude Opus 4.6 reasoning, optimized for Apple Silicon. **Size:** ~9.8 GB | **Bits/weight:** 8.864 | **Quality:** Good balance of quality and size ## Background This model starts from [Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled](https://huggingface.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled), a Qwen3.5-9B base fine-tuned via knowledge distillation from Claude Opus 4.6 to replicate its chain-of-thought reasoning style. Abliteration was applied using the technique from Arditi et al. (2024), adapted with a custom script to handle the hybrid DeltaNet/full-attention architecture. The result is a model that retains strong reasoning and vision capabilities while removing refusal behavior. The model was then converted to MLX-VLM format and quantized to 8-bit for Apple Silicon inference. ## Architecture - **Type:** Qwen3_5ForConditionalGeneration (multimodal) - **Layers:** 32 total — 24 linear attention (DeltaNet) + 8 full attention - **Hidden size:** 4096 | **Intermediate size:** 12288 - **Vision encoder:** 27-layer ViT - **Inputs:** Text, images, video ## Confirmed Capabilities - **Vision:** Correctly describes image content - **Reasoning:** Step-by-step mathematical problem solving (e.g., integration by parts) - **Uncensored:** Responds to sensitive prompts without refusal ## Usage ```python from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config model_path = "andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-8bit" model, processor = load(model_path) config = load_config(model_path) # Text-only prompt = apply_chat_template(processor, config, "Your question here", num_images=0) result = generate(model, processor, prompt, max_tokens=500) print(result.text) # Vision prompt = apply_chat_template(processor, config, "Describe this image", num_images=1) result = generate(model, processor, prompt, max_tokens=500, image=["image.jpg"]) print(result.text) ``` ## Model Family | Model | Size | Bits/Weight | Notes | |---|---|---|---| | [andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-fp16](https://huggingface.co/andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-fp16) | ~4 GB | 16 | 2B, best quality | | [andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-8bit](https://huggingface.co/andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-8bit) | ~2.1 GB | 8 | 2B, balanced | | [andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-4bit](https://huggingface.co/andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-4bit) | ~1.2 GB | 4 | 2B, smallest | | [andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-fp16](https://huggingface.co/andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-fp16) | ~18 GB | 16 | 9B, best quality | | **andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-8bit** | ~9.8 GB | 8.864 | **This model** | | [andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-4bit](https://huggingface.co/andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-4bit) | ~5.6 GB | 5.059 | 9B, smallest | ## Credits - Base distillation: [Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled](https://huggingface.co/Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled) - Abliteration technique: Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction" (2024) - MLX-VLM framework: [Apple MLX-VLM](https://github.com/Blaizzy/mlx-vlm)