Instructions to use lorelaiassistant/mxbai-embed-large-v1-mlx-int6 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use lorelaiassistant/mxbai-embed-large-v1-mlx-int6 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir mxbai-embed-large-v1-mlx-int6 lorelaiassistant/mxbai-embed-large-v1-mlx-int6
- sentence-transformers
How to use lorelaiassistant/mxbai-embed-large-v1-mlx-int6 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("lorelaiassistant/mxbai-embed-large-v1-mlx-int6") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
mxbai-embed-large-v1 โ MLX int6 quantization
6-bit group-quantized port of mixedbread-ai/mxbai-embed-large-v1 for the MLX framework on Apple Silicon.
What was quantized
- Linear layers in all 24 BERT encoder blocks (attention Q/K/V/output, FFN intermediate/output) and the pooler dense layer are quantized to 6-bit affine, group_size=64.
- Embedding tables (word, position, token type) are kept in fp16 โ quantizing them tends to hurt retrieval quality more than the saved memory is worth.
- LayerNorm weights remain in fp16.
Why int6
Internal benchmark across fp16 / int4 / int5 / int6 / int8 on a 200-query monolingual English retrieval set (50 fact groups ร 4 paraphrases vs 100 distractor facts):
| Variant | Disk | GPU peak (embed) | Embed mean | top-1 stab vs fp16 | top-1 vs ground truth | top-5 jaccard | MRR drift |
|---|---|---|---|---|---|---|---|
| fp16 | 639 MB | 1411 MB | 27.6 ms | โ | 93.6% | โ | โ |
| int8 | 368 MB | 538 MB | 25.4 ms | 99.5% | 93.1% | 0.99 | +0.0033 |
| int6 | 296 MB | 466 MB | 16.1 ms | 99.0% | 93.6% | 0.97 | +0.0000 |
| int5 | 260 MB | 430 MB | 17.3 ms | 99.0% | 93.6% | 0.94 | +0.0008 |
| int4 | 224 MB | 394 MB | 13.0 ms | 97.5% | 95.1% | 0.87 | -0.0082 |
int6 preserved the fp16 baseline exactly on top-1 accuracy and MRR, with the highest top-5 jaccard among quantized variants. It also embeds 1.7ร faster than int8 because of smaller intermediate matmul tensors.
Usage with MLXEmbedders (Swift)
import MLXEmbedders
import MLXLMCommon
let config = ModelConfiguration(
id: .id("lorelaiassistant/mxbai-embed-large-v1-mlx-int6")
)
let container = try await EmbedderModelFactory.shared.loadContainer(
from: hubDownloader,
using: huggingFaceTokenizerLoader,
configuration: config,
progressHandler: { _ in }
)
The MLXEmbedders loader auto-detects the quantization block in config.json and applies mlx.nn.quantize to the matching Linear layers at load time.
Usage with mlx.core (Python)
The standard mlx.core.load("model.safetensors") returns the quantized weights; build a BERT module that uses mlx.nn.QuantizedLinear (or call mlx.nn.quantize(model, group_size=64, bits=6) on a fresh fp16 model and load the weights afterward).
Caveats
- Vector space is incompatible with the fp16 base model. If you have an existing index built with fp16 mxbai, you must re-embed it before switching.
- Tested on a synthetic 200-query English retrieval set; before high-stakes production use, validate on your domain.
Attribution
Base model ยฉ Mixedbread AI, released under Apache 2.0. This quantization preserves the same license. See the original repository for model card, citation, and training details.
- Downloads last month
- 160
Quantized
Model tree for lorelaiassistant/mxbai-embed-large-v1-mlx-int6
Base model
mixedbread-ai/mxbai-embed-large-v1