feat: publish Qwen3-Embedding-4B MLX 4-bit

15c6a2c verified about 1 month ago

2.47 kB

library_name: mlx-embeddings
tags:
  - mlx
  - mlx-embeddings
  - embeddings
  - sentence-similarity
  - feature-extraction
  - quantized
  - 4bit
  - qwen
  - qwen3
  - qwen3-embedding
base_model: Qwen/Qwen3-Embedding-4B
license: apache-2.0
pipeline_tag: feature-extraction
language:
  - en
  - zh
  - multilingual

Qwen3-Embedding-4B MLX 4-bit

MLX 4-bit quantization of Qwen/Qwen3-Embedding-4B, produced with mlx-embeddings on Apple Silicon.

What is this?

Qwen3-Embedding is a decoder-only LLM-style text embedding model from the Qwen3 family, using last-token pooling to produce dense vector representations. It scores near the top of MMTEB multilingual benchmarks while retaining Apache-2.0 licensing.

Quantization

Method: MLX affine quantization (mlx_embeddings.convert), group_size=64
Bits per weight: 4
Output size: 2.1 GB (vs ~7.5 GB for bf16 source)

Quickstart

from mlx_embeddings import load

model, tokenizer = load("majentik/Qwen3-Embedding-4B-MLX-4bit")

inputs = tokenizer(
    ["What is the capital of France?", "Paris is the capital of France."],
    padding=True, truncation=True, return_tensors="mlx"
)
outputs = model(inputs["input_ids"], attention_mask=inputs["attention_mask"])
embeddings = outputs.text_embeds  # already L2-normalised, shape [batch, dim]

For sentence similarity:

import mlx.core as mx

e = embeddings
scores = (e[0] @ e[1:].T).tolist()
print(scores)

Model Specifications

Property	Value
Base Model	Qwen/Qwen3-Embedding-4B
Architecture	Decoder-only (Qwen3ForCausalLM) with last-token pooling
Parameters	4B (4.0B) (pre-quantization)
Context Length	32K
Embedding Dim	2560
BF16 Size	~7.5 GB
License	apache-2.0
Languages	100+ (multilingual)

License

Apache 2.0 — inherited from the upstream Qwen3-Embedding model. Free for research and commercial use.

majentik
/

Qwen3-Embedding-4B-MLX-4bit