How to use from
llama.cpp
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf LiquidAI/LFM2.5-Embedding-350M-GGUF:
# Run inference directly in the terminal:
llama cli -hf LiquidAI/LFM2.5-Embedding-350M-GGUF:
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf LiquidAI/LFM2.5-Embedding-350M-GGUF:
# Run inference directly in the terminal:
llama cli -hf LiquidAI/LFM2.5-Embedding-350M-GGUF:
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf LiquidAI/LFM2.5-Embedding-350M-GGUF:
# Run inference directly in the terminal:
./llama-cli -hf LiquidAI/LFM2.5-Embedding-350M-GGUF:
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf LiquidAI/LFM2.5-Embedding-350M-GGUF:
# Run inference directly in the terminal:
./build/bin/llama-cli -hf LiquidAI/LFM2.5-Embedding-350M-GGUF:
Use Docker
docker model run hf.co/LiquidAI/LFM2.5-Embedding-350M-GGUF:
Quick Links
Liquid AI

LFM2.5-Embedding-350M

LFM2.5-Embedding-350M is a dense bi-encoder for fast multilingual retrieval. It produces a single vector per document — the smallest, fastest index — for reliable cross-lingual search across 11 languages.

  • Best-in-class multilingual accuracy for a dense embedder of its size.
  • Inference speed is on par with much smaller models, thanks to the efficient LFM2 backbone.
  • You can use it as a drop-in replacement in your current RAG pipelines.

Find more information about LFM2.5-Embedding-350M in our blog post.

🏃 How to run

Example usage with llama.cpp:

Start llama-server

llama-server -hf LiquidAI/LFM2.5-Embedding-350M-GGUF --embeddings

Make requests to embed queries and documents, and rank by cosine similarity (note the asymmetric query: / document: prompt prefixes)

❯ uv run dense-retrieve.py

Score: -0.1783 | Q: What is panda? | D: hi
Score:  0.0511 | Q: What is panda? | D: it is a bear
Score:  0.5657 | Q: What is panda? | D: The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.
# /// script
# requires-python = ">=3.10"
# dependencies = ["numpy", "requests"]
# ///

# dense-retrieve.py
import numpy as np, requests

QUERY_PREFIX, DOC_PREFIX = "query: ", "document: "

def embed(text: str) -> np.ndarray:
    r = requests.post(
        "http://localhost:8080/v1/embeddings",
        json={"input": text},
    )
    v = np.array(r.json()["data"][0]["embedding"])
    return v / np.linalg.norm(v)

docs = [
    "hi",
    "it is a bear",
    "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.",
]
query = "What is panda?"

q = embed(QUERY_PREFIX + query)
for doc in docs:
    d = embed(DOC_PREFIX + doc)
    print(f"Score: {float(q @ d):.4f} | Q: {query} | D: {doc}")

Find more details in the original model card: https://huggingface.co/LiquidAI/LFM2.5-Embedding-350M

Downloads last month
4,346
GGUF
Model size
0.4B params
Architecture
lfm2-bidir
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LiquidAI/LFM2.5-Embedding-350M-GGUF

Quantized
(4)
this model

Collection including LiquidAI/LFM2.5-Embedding-350M-GGUF