File size: 1,601 Bytes
4a8c0a7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
language: ar
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- sparse-encoder
- splade
- arabic
- retrieval
datasets:
- oddadmix/arabic-triplets-large
base_model: distilbert-base-multilingual-cased
metrics:
- ndcg@10
- mrr@10
---

# Arabic SPLADE — Phase 3

Efficient symmetric SPLADE using DistilBERT multilingual for faster inference.

## Architecture

Symmetric shared (MLMTransformer+SpladePooling, sequential)

**Base model:** distilbert-base-multilingual-cased

## Training

- **Dataset:** `oddadmix/arabic-triplets-large` (104K triplets, 92K unique passages)
- **Loss:** `SpladeLoss(SparseMultipleNegativesRankingLoss, q_reg=5e-5, d_reg=3e-5)`
- **Batch:** 16 per GPU, grad accum 4
- **Learning rate:** 2e-5
- **Epochs:** 1
- **AMP:** fp16
- **Sampler:** NO_DUPLICATES

## Evaluation on Arabic NanoBEIR (13 datasets)

| Metric | Score |
|--------|-------|
| NDCG@10 | 0.2528 |
| MRR@10 | 0.3052 |

For reference: BM25 scores 0.3824 NDCG@10, 0.4483 MRR@10 on the same benchmark.

## Training Details

DistilBERT multilingual (6-layer, 119K vocab), ~2x faster than AraBERT

### Hardware
- 2× NVIDIA TITAN RTX (23.5 GB each)
- DDP via `torchrun`

## Usage

```python
from sentence_transformers.sparse_encoder import SparseEncoder

model = SparseEncoder("Abdelkareem/arabic-splade-efficient")
embeddings = model.encode([
    "ما هي عاصمة مصر؟",
    "القاهرة هي عاصمة مصر وأكبر مدنها.",
])
print(embeddings.shape)
# Decode top tokens
decoded = model.decode(embeddings, top_k=10)
for d in decoded:
    print(d)
```