--- language: ar license: apache-2.0 library_name: sentence-transformers tags: - sentence-transformers - sparse-encoder - splade - arabic - retrieval datasets: - oddadmix/arabic-triplets-large base_model: distilbert-base-multilingual-cased metrics: - ndcg@10 - mrr@10 --- # Arabic SPLADE — Phase 3 Efficient symmetric SPLADE using DistilBERT multilingual for faster inference. ## Architecture Symmetric shared (MLMTransformer+SpladePooling, sequential) **Base model:** distilbert-base-multilingual-cased ## Training - **Dataset:** `oddadmix/arabic-triplets-large` (104K triplets, 92K unique passages) - **Loss:** `SpladeLoss(SparseMultipleNegativesRankingLoss, q_reg=5e-5, d_reg=3e-5)` - **Batch:** 16 per GPU, grad accum 4 - **Learning rate:** 2e-5 - **Epochs:** 1 - **AMP:** fp16 - **Sampler:** NO_DUPLICATES ## Evaluation on Arabic NanoBEIR (13 datasets) | Metric | Score | |--------|-------| | NDCG@10 | 0.2528 | | MRR@10 | 0.3052 | For reference: BM25 scores 0.3824 NDCG@10, 0.4483 MRR@10 on the same benchmark. ## Training Details DistilBERT multilingual (6-layer, 119K vocab), ~2x faster than AraBERT ### Hardware - 2× NVIDIA TITAN RTX (23.5 GB each) - DDP via `torchrun` ## Usage ```python from sentence_transformers.sparse_encoder import SparseEncoder model = SparseEncoder("Abdelkareem/arabic-splade-efficient") embeddings = model.encode([ "ما هي عاصمة مصر؟", "القاهرة هي عاصمة مصر وأكبر مدنها.", ]) print(embeddings.shape) # Decode top tokens decoded = model.decode(embeddings, top_k=10) for d in decoded: print(d) ```