Upload Qwen3.5-9B Distilled OPUS Heretic MLX-VLM 8-bit

f845f94 verified 3 months ago

3.77 kB

library_name: mlx
license: apache-2.0
base_model: Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled
pipeline_tag: image-text-to-text
tags:
  - mlx
  - mlx-vlm
  - qwen3.5
  - heretic
  - uncensored
  - abliterated
  - multimodal
  - vision

Qwen3.5-9B Distilled OPUS Heretic - MLX-VLM 8bit

8-bit quantized MLX-VLM conversion of an abliterated Qwen3.5-9B model distilled from Claude Opus 4.6 reasoning, optimized for Apple Silicon.

Size: ~9.8 GB | Bits/weight: 8.864 | Quality: Good balance of quality and size

Background

This model starts from Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled, a Qwen3.5-9B base fine-tuned via knowledge distillation from Claude Opus 4.6 to replicate its chain-of-thought reasoning style.

Abliteration was applied using the technique from Arditi et al. (2024), adapted with a custom script to handle the hybrid DeltaNet/full-attention architecture. The result is a model that retains strong reasoning and vision capabilities while removing refusal behavior.

The model was then converted to MLX-VLM format and quantized to 8-bit for Apple Silicon inference.

Architecture

Type: Qwen3_5ForConditionalGeneration (multimodal)
Layers: 32 total — 24 linear attention (DeltaNet) + 8 full attention
Hidden size: 4096 | Intermediate size: 12288
Vision encoder: 27-layer ViT
Inputs: Text, images, video

Confirmed Capabilities

Vision: Correctly describes image content
Reasoning: Step-by-step mathematical problem solving (e.g., integration by parts)
Uncensored: Responds to sensitive prompts without refusal

Usage

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model_path = "andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-8bit"
model, processor = load(model_path)
config = load_config(model_path)

# Text-only
prompt = apply_chat_template(processor, config, "Your question here", num_images=0)
result = generate(model, processor, prompt, max_tokens=500)
print(result.text)

# Vision
prompt = apply_chat_template(processor, config, "Describe this image", num_images=1)
result = generate(model, processor, prompt, max_tokens=500, image=["image.jpg"])
print(result.text)

Model Family

Model	Size	Bits/Weight	Notes
andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-fp16	~4 GB	16	2B, best quality
andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-8bit	~2.1 GB	8	2B, balanced
andrevp/Qwen3.5-2B-Distilled-OPUS-Heretic-MLX-VLM-4bit	~1.2 GB	4	2B, smallest
andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-fp16	~18 GB	16	9B, best quality
andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-8bit	~9.8 GB	8.864	This model
andrevp/Qwen3.5-9B-Distilled-OPUS-Heretic-MLX-VLM-4bit	~5.6 GB	5.059	9B, smallest

Credits

Base distillation: Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled
Abliteration technique: Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction" (2024)
MLX-VLM framework: Apple MLX-VLM