Qwen3-Embedding-4B MLX 4-bit

MLX 4-bit quantization of Qwen/Qwen3-Embedding-4B, produced with mlx-embeddings on Apple Silicon.

What is this?

Qwen3-Embedding is a decoder-only LLM-style text embedding model from the Qwen3 family, using last-token pooling to produce dense vector representations. It scores near the top of MMTEB multilingual benchmarks while retaining Apache-2.0 licensing.

Quantization

  • Method: MLX affine quantization (mlx_embeddings.convert), group_size=64
  • Bits per weight: 4
  • Output size: 2.1 GB (vs ~7.5 GB for bf16 source)

Quickstart

from mlx_embeddings import load

model, tokenizer = load("majentik/Qwen3-Embedding-4B-MLX-4bit")

inputs = tokenizer(
    ["What is the capital of France?", "Paris is the capital of France."],
    padding=True, truncation=True, return_tensors="mlx"
)
outputs = model(inputs["input_ids"], attention_mask=inputs["attention_mask"])
embeddings = outputs.text_embeds  # already L2-normalised, shape [batch, dim]

For sentence similarity:

import mlx.core as mx

e = embeddings
scores = (e[0] @ e[1:].T).tolist()
print(scores)

Model Specifications

Property Value
Base Model Qwen/Qwen3-Embedding-4B
Architecture Decoder-only (Qwen3ForCausalLM) with last-token pooling
Parameters 4B (4.0B) (pre-quantization)
Context Length 32K
Embedding Dim 2560
BF16 Size ~7.5 GB
License apache-2.0
Languages 100+ (multilingual)

License

Apache 2.0 — inherited from the upstream Qwen3-Embedding model. Free for research and commercial use.

See also

Downloads last month
117
Safetensors
Model size
0.6B params
Tensor type
F16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/Qwen3-Embedding-4B-MLX-4bit

Finetuned
(48)
this model