mxbai-embed-large-v1 — MLX int6 quantization

6-bit group-quantized port of mixedbread-ai/mxbai-embed-large-v1 for the MLX framework on Apple Silicon.

What was quantized

Linear layers in all 24 BERT encoder blocks (attention Q/K/V/output, FFN intermediate/output) and the pooler dense layer are quantized to 6-bit affine, group_size=64.
Embedding tables (word, position, token type) are kept in fp16 — quantizing them tends to hurt retrieval quality more than the saved memory is worth.
LayerNorm weights remain in fp16.

Why int6

Internal benchmark across fp16 / int4 / int5 / int6 / int8 on a 200-query monolingual English retrieval set (50 fact groups × 4 paraphrases vs 100 distractor facts):

Variant	Disk	GPU peak (embed)	Embed mean	top-1 stab vs fp16	top-1 vs ground truth	top-5 jaccard	MRR drift
fp16	639 MB	1411 MB	27.6 ms	—	93.6%	—	—
int8	368 MB	538 MB	25.4 ms	99.5%	93.1%	0.99	+0.0033
int6	296 MB	466 MB	16.1 ms	99.0%	93.6%	0.97	+0.0000
int5	260 MB	430 MB	17.3 ms	99.0%	93.6%	0.94	+0.0008
int4	224 MB	394 MB	13.0 ms	97.5%	95.1%	0.87	-0.0082

int6 preserved the fp16 baseline exactly on top-1 accuracy and MRR, with the highest top-5 jaccard among quantized variants. It also embeds 1.7× faster than int8 because of smaller intermediate matmul tensors.

Usage with MLXEmbedders (Swift)

import MLXEmbedders
import MLXLMCommon

let config = ModelConfiguration(
    id: .id("lorelaiassistant/mxbai-embed-large-v1-mlx-int6")
)

let container = try await EmbedderModelFactory.shared.loadContainer(
    from: hubDownloader,
    using: huggingFaceTokenizerLoader,
    configuration: config,
    progressHandler: { _ in }
)

The MLXEmbedders loader auto-detects the quantization block in config.json and applies mlx.nn.quantize to the matching Linear layers at load time.

Usage with mlx.core (Python)

The standard mlx.core.load("model.safetensors") returns the quantized weights; build a BERT module that uses mlx.nn.QuantizedLinear (or call mlx.nn.quantize(model, group_size=64, bits=6) on a fresh fp16 model and load the weights afterward).

Caveats

Vector space is incompatible with the fp16 base model. If you have an existing index built with fp16 mxbai, you must re-embed it before switching.
Tested on a synthetic 200-query English retrieval set; before high-stakes production use, validate on your domain.

Attribution

Base model © Mixedbread AI, released under Apache 2.0. This quantization preserves the same license. See the original repository for model card, citation, and training details.

Downloads last month: 160

Safetensors

Model size

98.4M params

Tensor type

F16

U32

MLX

Hardware compatibility

Quantized

Model tree for lorelaiassistant/mxbai-embed-large-v1-mlx-int6

Base model

mixedbread-ai/mxbai-embed-large-v1

Quantized

(14)

this model