majentik's picture
feat: publish Qwen3-Embedding-4B MLX 4-bit
15c6a2c verified
metadata
library_name: mlx-embeddings
tags:
  - mlx
  - mlx-embeddings
  - embeddings
  - sentence-similarity
  - feature-extraction
  - quantized
  - 4bit
  - qwen
  - qwen3
  - qwen3-embedding
base_model: Qwen/Qwen3-Embedding-4B
license: apache-2.0
pipeline_tag: feature-extraction
language:
  - en
  - zh
  - multilingual

Qwen3-Embedding-4B MLX 4-bit

MLX 4-bit quantization of Qwen/Qwen3-Embedding-4B, produced with mlx-embeddings on Apple Silicon.

What is this?

Qwen3-Embedding is a decoder-only LLM-style text embedding model from the Qwen3 family, using last-token pooling to produce dense vector representations. It scores near the top of MMTEB multilingual benchmarks while retaining Apache-2.0 licensing.

Quantization

  • Method: MLX affine quantization (mlx_embeddings.convert), group_size=64
  • Bits per weight: 4
  • Output size: 2.1 GB (vs ~7.5 GB for bf16 source)

Quickstart

from mlx_embeddings import load

model, tokenizer = load("majentik/Qwen3-Embedding-4B-MLX-4bit")

inputs = tokenizer(
    ["What is the capital of France?", "Paris is the capital of France."],
    padding=True, truncation=True, return_tensors="mlx"
)
outputs = model(inputs["input_ids"], attention_mask=inputs["attention_mask"])
embeddings = outputs.text_embeds  # already L2-normalised, shape [batch, dim]

For sentence similarity:

import mlx.core as mx

e = embeddings
scores = (e[0] @ e[1:].T).tolist()
print(scores)

Model Specifications

Property Value
Base Model Qwen/Qwen3-Embedding-4B
Architecture Decoder-only (Qwen3ForCausalLM) with last-token pooling
Parameters 4B (4.0B) (pre-quantization)
Context Length 32K
Embedding Dim 2560
BF16 Size ~7.5 GB
License apache-2.0
Languages 100+ (multilingual)

License

Apache 2.0 — inherited from the upstream Qwen3-Embedding model. Free for research and commercial use.

See also