majentik
/

Qwen3-Embedding-4B-MLX-4bit

Feature Extraction

sentence-similarity

qwen3-embedding

Model card Files Files and versions

Qwen3-Embedding-4B-MLX-4bit / README.md

majentik's picture

feat: publish Qwen3-Embedding-4B MLX 4-bit

15c6a2c verified about 2 months ago

|

history blame contribute delete

2.47 kB

	---
	library_name: mlx-embeddings
	tags:
	- mlx
	- mlx-embeddings
	- embeddings
	- sentence-similarity
	- feature-extraction
	- quantized
	- 4bit
	- qwen
	- qwen3
	- qwen3-embedding
	base_model: Qwen/Qwen3-Embedding-4B
	license: apache-2.0
	pipeline_tag: feature-extraction
	language:
	- en
	- zh
	- multilingual
	---

	# Qwen3-Embedding-4B MLX 4-bit

	MLX 4-bit quantization of [Qwen/Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B), produced with [mlx-embeddings](https://github.com/Blaizzy/mlx-embeddings) on Apple Silicon.

	## What is this?

	Qwen3-Embedding is a decoder-only LLM-style text embedding model from the Qwen3 family, using last-token pooling to produce dense vector representations. It scores near the top of MMTEB multilingual benchmarks while retaining Apache-2.0 licensing.

	## Quantization

	- Method: MLX affine quantization (`mlx_embeddings.convert`), group_size=64
	- Bits per weight: 4
	- Output size: 2.1 GB (vs ~7.5 GB for bf16 source)

	## Quickstart

	```python
	from mlx_embeddings import load

	model, tokenizer = load("majentik/Qwen3-Embedding-4B-MLX-4bit")

	inputs = tokenizer(
	["What is the capital of France?", "Paris is the capital of France."],
	padding=True, truncation=True, return_tensors="mlx"
	)
	outputs = model(inputs["input_ids"], attention_mask=inputs["attention_mask"])
	embeddings = outputs.text_embeds # already L2-normalised, shape [batch, dim]
	```

	For sentence similarity:

	```python
	import mlx.core as mx

	e = embeddings
	scores = (e[0] @ e[1:].T).tolist()
	print(scores)
	```

	## Model Specifications

	\| Property \| Value \|
	\|---\|---\|
	\| Base Model \| [Qwen/Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B) \|
	\| Architecture \| Decoder-only (Qwen3ForCausalLM) with last-token pooling \|
	\| Parameters \| 4B (4.0B) (pre-quantization) \|
	\| Context Length \| 32K \|
	\| Embedding Dim \| 2560 \|
	\| BF16 Size \| ~7.5 GB \|
	\| License \| apache-2.0 \|
	\| Languages \| 100+ (multilingual) \|

	## License

	Apache 2.0 — inherited from the upstream Qwen3-Embedding model. Free for research and commercial use.

	## See also

	- Base: [Qwen/Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B)
	- Official GGUF: [Qwen/Qwen3-Embedding-4B-GGUF](https://huggingface.co/Qwen/Qwen3-Embedding-4B-GGUF) (if published by Qwen)
	- mlx-embeddings package: https://github.com/Blaizzy/mlx-embeddings
	- Garden hub: [majentik/garden](https://huggingface.co/majentik/garden)
	- MTEB leaderboard: https://huggingface.co/spaces/mteb/leaderboard