andrevp's picture
Upload Qwen3.5-9B Distilled OPUS Heretic MLX-VLM 8-bit
f845f94 verified
|
Raw
History Blame Contribute Delete
3.77 kB
metadata
library_name: mlx
license: apache-2.0
base_model: Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled
pipeline_tag: image-text-to-text
tags:
  - mlx
  - mlx-vlm
  - qwen3.5
  - heretic
  - uncensored
  - abliterated
  - multimodal
  - vision

Qwen3.5-9B Distilled OPUS Heretic - MLX-VLM 8bit

8-bit quantized MLX-VLM conversion of an abliterated Qwen3.5-9B model distilled from Claude Opus 4.6 reasoning, optimized for Apple Silicon.

Size: ~9.8 GB | Bits/weight: 8.864 | Quality: Good balance of quality and size

Background

This model starts from Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled, a Qwen3.5-9B base fine-tuned via knowledge distillation from Claude Opus 4.6 to replicate its chain-of-thought reasoning style.

Abliteration was applied using the technique from Arditi et al. (2024), adapted with a custom script to handle the hybrid DeltaNet/full-attention architecture. The result is a model that retains strong reasoning and vision capabilities while removing refusal behavior.

The model was then converted to MLX-VLM format and quantized to 8-bit for Apple Silicon inference.

Architecture

  • Type: Qwen3_5ForConditionalGeneration (multimodal)
  • Layers: 32 total — 24 linear attention (DeltaNet) + 8 full attention
  • Hidden size: 4096 | Intermediate size: 12288
  • Vision encoder: 27-layer ViT
  • Inputs: Text, images, video

Confirmed Capabilities

  • Vision: Correctly describes image content
  • Reasoning: Step-by-step mathematical problem solving (e.g., integration by parts)
  • Uncensored: Responds to sensitive prompts without refusal

Usage

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model_path = "andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-8bit"
model, processor = load(model_path)
config = load_config(model_path)

# Text-only
prompt = apply_chat_template(processor, config, "Your question here", num_images=0)
result = generate(model, processor, prompt, max_tokens=500)
print(result.text)

# Vision
prompt = apply_chat_template(processor, config, "Describe this image", num_images=1)
result = generate(model, processor, prompt, max_tokens=500, image=["image.jpg"])
print(result.text)

Model Family

Model Size Bits/Weight Notes
andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-fp16 ~4 GB 16 2B, best quality
andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-8bit ~2.1 GB 8 2B, balanced
andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-4bit ~1.2 GB 4 2B, smallest
andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-fp16 ~18 GB 16 9B, best quality
andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-8bit ~9.8 GB 8.864 This model
andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-4bit ~5.6 GB 5.059 9B, smallest

Credits