Instructions to use Waqf-AI/arabic-splade-efficient with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Waqf-AI/arabic-splade-efficient with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Waqf-AI/arabic-splade-efficient") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
metadata
language: ar
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- sparse-encoder
- splade
- arabic
- retrieval
datasets:
- oddadmix/arabic-triplets-large
base_model: distilbert-base-multilingual-cased
metrics:
- ndcg@10
- mrr@10
Arabic SPLADE — Phase 3
Efficient symmetric SPLADE using DistilBERT multilingual for faster inference.
Architecture
Symmetric shared (MLMTransformer+SpladePooling, sequential)
Base model: distilbert-base-multilingual-cased
Training
- Dataset:
oddadmix/arabic-triplets-large(104K triplets, 92K unique passages) - Loss:
SpladeLoss(SparseMultipleNegativesRankingLoss, q_reg=5e-5, d_reg=3e-5) - Batch: 16 per GPU, grad accum 4
- Learning rate: 2e-5
- Epochs: 1
- AMP: fp16
- Sampler: NO_DUPLICATES
Evaluation on Arabic NanoBEIR (13 datasets)
| Metric | Score |
|---|---|
| NDCG@10 | 0.2528 |
| MRR@10 | 0.3052 |
For reference: BM25 scores 0.3824 NDCG@10, 0.4483 MRR@10 on the same benchmark.
Training Details
DistilBERT multilingual (6-layer, 119K vocab), ~2x faster than AraBERT
Hardware
- 2× NVIDIA TITAN RTX (23.5 GB each)
- DDP via
torchrun
Usage
from sentence_transformers.sparse_encoder import SparseEncoder
model = SparseEncoder("Abdelkareem/arabic-splade-efficient")
embeddings = model.encode([
"ما هي عاصمة مصر؟",
"القاهرة هي عاصمة مصر وأكبر مدنها.",
])
print(embeddings.shape)
# Decode top tokens
decoded = model.decode(embeddings, top_k=10)
for d in decoded:
print(d)