LFM2.5-ColBERT-350M — MLX (4-bit)

MLX build of LiquidAI/LFM2.5-ColBERT-350M, a multilingual late-interaction retriever (128-dim vector per token, scored with MaxSim), for local inference on Apple Silicon with MLX.

All weights, architecture, and behavior are LiquidAI's. This repository changes the file format (PyTorch/safetensors → MLX) and post-training quantized to 4-bit (affine, group size 64) from the bf16 MLX conversion. Every Linear and embedding layer (including the 1024→128 Dense projection head) is quantized; the non-quantized layers (conv, norms) stay bf16. See the original model card for training details and intended use.

Quantization details

Quantized with mlx.nn.quantize(mode='affine', bits=4, group_size=64) — the exact configuration benchmarked below.
Verified bit-exact: the reloaded checkpoint's encodings are identical (max abs diff 0) to the in-memory-quantized model used for the benchmark (verify_export.py) — the shipped artifact is the model measured below. Downstream ColBERT NDCG can wobble ≤0.002 across processes from GPU MaxSim reduction order — a scoring artifact, not the weights.
Reload applies the quant from config.json["quantization"] before loading weights (see retrieval.load_model).

Evaluation

Retrieval quality of this checkpoint (and its sibling precisions), measured as NDCG@10 / Recall@10 on judged pools. Retention = metric ÷ bf16 metric, averaged per-dataset.

Setup. English = the four NanoBEIR sets (full small corpora, ~2–5k passages, 50 queries each). Multilingual = MIRACL dev (the real queries and relevance judgments) for Spanish, German, Japanese, Arabic, each scored over a reduced pool of ~6k passages (judged positives + hard-mined negatives + sampled distractors, from mteb/MIRACLRetrievalHardNegatives), 100 queries each. Reduced pools make absolute scores easier than full-corpus MIRACL and not leaderboard-comparable — but every precision searches the identical pool, so the retention numbers (the point of this table) are sound. ColBERT uses brute-force MaxSim with no query augmentation, so its absolute scores sit a touch below a full PLAID setup.

Summary (mean over 8 datasets)

precision	NDCG@10	NDCG retention	Recall@10	Recall retention	size
bf16	0.740	100.0%	0.780	100.0%	707 MB
8-bit	0.741	100.0%	0.779	99.4%	376 MB
4-bit ◄	0.731	98.7%	0.780	99.7%	199 MB
mxfp4	0.730	98.5%	0.773	98.8%	—

NDCG@10 by dataset

dataset	bf16	8-bit	4-bit ◄	mxfp4
NanoNQ · en	0.757	0.751	0.716	0.742
NanoFiQA2018 · en	0.528	0.512	0.524	0.520
NanoSciFact · en	0.693	0.712	0.702	0.682
NanoNFCorpus · en	0.345	0.342	0.335	0.334
MIRACL · es	0.900	0.901	0.899	0.900
MIRACL · de	0.823	0.837	0.826	0.811
MIRACL · ja	0.934	0.933	0.923	0.926
MIRACL · ar	0.938	0.941	0.924	0.926

License & attribution

Redistributed under the LFM Open License v1.0 (LICENSE) — the same license as the original model. Per Section 4, this notice records that the files were modified (format conversion to MLX + 4-bit quantization). The original work is by Liquid AI; this repository is an independent conversion, not affiliated with or endorsed by Liquid AI. The license includes a commercial-use threshold (Section 5) — review it for your use case.

Base model: LiquidAI/LFM2.5-ColBERT-350M

Downloads last month: 34

Safetensors

Model size

55.3M params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for ronaldmannak/LFM2.5-ColBERT-350M-4bit

Base model

LiquidAI/LFM2.5-350M-Base

Finetuned

LiquidAI/LFM2.5-ColBERT-350M

Finetuned

(11)

this model