Qwen3-VL-Embedding-8B — 8-bit MLX

Qwen3-VL-Embedding-8B converted to 8-bit quantized MLX format for Apple Silicon. Original model: Qwen/Qwen3-VL-Embedding-8B

Property Value
Base model Qwen/Qwen3-VL-Embedding-8B
Quantization 8-bit affine, group_size=64
Model size ~9.2 GB
Embedding dim 4096
License Apache 2.0

Usage

Install dependencies:

pip install mlx-embeddings mlx-vlm torch torchvision

Note: torch and torchvision are required only for image preprocessing (via transformers.AutoImageProcessor). The model itself runs entirely on MLX.

Known issues with mlx-embeddings / mlx-vlm

Two patches are currently required:

  1. model.load_weights() raises Missing 1 parameters: language_model.lm_head.weight because embedding models do not have an lm_head. Fix: use strict=False.
  2. transformers >= 4.50 requires processor.image_ids / audio_ids / video_ids attributes that mlx-embeddings does not set. Fix: set them manually after loading.

Full working example

import mlx.core as mx
import mlx.nn as nn
import numpy as np
from mlx_embeddings import load

# Patch 1: allow missing lm_head weight
_orig_lw = nn.Module.load_weights
def _lw(self, w, strict=False):
    return _orig_lw(self, w, strict=strict)
nn.Module.load_weights = _lw

model, processor = load("nkamiy/Qwen3-VL-Embedding-8B-8bit-mlx")

# Patch 2: missing processor attributes (transformers >= 4.50)
inner = getattr(processor, "processor", processor)
if not hasattr(inner, "image_ids"):
    inner.image_ids = [getattr(inner, "image_token_id", None)]
if not hasattr(inner, "video_ids"):
    inner.video_ids = [getattr(inner, "video_token_id", None)]
if not hasattr(inner, "audio_ids"):
    inner.audio_ids = [None]

# Embed text, image, or both
inputs = [
    {"text": "a man arguing with a plant",
     "instruction": "Retrieve images or text relevant to the user's query."},
    {"text": "a comedic scene in a flower shop"},
    {"image": "/path/to/thumbnail.png", "text": "dialogue here"},
]

embeddings = model.process(inputs, processor=processor)
mx.eval(embeddings)

arr = np.array(embeddings.astype(mx.float32))
# arr.shape == (3, 4096), L2-normalized
similarity = arr @ arr.T
print(similarity)

Conversion

Converted from the original Hugging Face weights using mlx_vlm.convert with a strict=False patch:

python -m mlx_vlm convert \
  --hf-path Qwen/Qwen3-VL-Embedding-8B \
  --mlx-path ./Qwen3-VL-Embedding-8B-8bit-mlx \
  --quantize --q-bits 8 --q-group-size 64 --q-mode affine

License

Apache 2.0 — same as the original Qwen/Qwen3-VL-Embedding-8B.

Downloads last month
138
Safetensors
Model size
3B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nkamiy/Qwen3-VL-Embedding-8B-8bit-mlx

Quantized
(19)
this model