Sentence Similarity
Transformers
Safetensors
French
camembert
fill-mask
passage-retrieval
Eval Results (legacy)
text-embeddings-inference
Instructions to use antoinelouis/splade-max-camembert-base-mmarcoFR with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use antoinelouis/splade-max-camembert-base-mmarcoFR with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("antoinelouis/splade-max-camembert-base-mmarcoFR") model = AutoModelForMaskedLM.from_pretrained("antoinelouis/splade-max-camembert-base-mmarcoFR") - Notebooks
- Google Colab
- Kaggle
| pipeline_tag: sentence-similarity | |
| language: fr | |
| license: mit | |
| datasets: | |
| - unicamp-dl/mmarco | |
| metrics: | |
| - recall | |
| tags: | |
| - passage-retrieval | |
| library_name: transformers | |
| base_model: almanach/camembert-base | |
| model-index: | |
| - name: spladev2-camembert-base-mmarcoFR | |
| results: | |
| - task: | |
| type: sentence-similarity | |
| name: Passage Retrieval | |
| dataset: | |
| type: unicamp-dl/mmarco | |
| name: mMARCO-fr | |
| config: french | |
| split: validation | |
| metrics: | |
| - type: recall_at_1000 | |
| name: Recall@1000 | |
| value: 89.86 | |
| - type: recall_at_500 | |
| name: Recall@500 | |
| value: 85.96 | |
| - type: recall_at_100 | |
| name: Recall@100 | |
| value: 73.94 | |
| - type: recall_at_10 | |
| name: Recall@10 | |
| value: 46.33 | |
| - type: map_at_10 | |
| name: MAP@10 | |
| value: 24.15 | |
| - type: ndcg_at_10 | |
| name: nDCG@10 | |
| value: 29.58 | |
| - type: mrr_at_10 | |
| name: MRR@10 | |
| value: 24.68 | |
| # spladev2-camembert-base-mmarcoFR | |
| This is a [SPLADE-max](https://doi.org/10.48550/arXiv.2109.10086) model for **French** that can be used for semantic search. The model maps queries and passages to | |
| 32k-dimensional sparse vectors which are used to compute relevance through cosine similarity. | |
| ## Usage | |
| Start by installing the [library](https://huggingface.co/docs/transformers): `pip install -U transformers`. Then, you can use the model like this: | |
| ```python | |
| import torch | |
| from torch.nn.functional import relu, normalize | |
| from transformers import AutoTokenizer, AutoModel | |
| queries = ["Ceci est un exemple de requête.", "Voici un second exemple."] | |
| passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."] | |
| tokenizer = AutoTokenizer.from_pretrained('antoinelouis/spladev2-camembert-base-mmarcoFR') | |
| model = AutoModel.from_pretrained('antoinelouis/spladev2-camembert-base-mmarcoFR') | |
| q_input = tokenizer(queries, padding=True, truncation=True, return_tensors='pt') | |
| p_input = tokenizer(passages, padding=True, truncation=True, return_tensors='pt') | |
| with torch.no_grad(): | |
| q_output = model(**q_input) | |
| p_output = model(**p_input) | |
| q_activations = torch.amax(torch.log1p(relu(q_output.logits * q_input['attention_mask'].unsqueeze(-1))), dim=1) | |
| p_activations = torch.amax(torch.log1p(relu(p_output.logits * p_input['attention_mask'].unsqueeze(-1))), dim=1) | |
| q_activations = normalize(q_activations, p=2, dim=1) | |
| p_activations = normalize(p_activations, p=2, dim=1) | |
| similarity = q_embeddings @ p_embeddings.T | |
| print(similarity) | |
| ``` | |
| ## Evaluation | |
| The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for a corpus of | |
| 8.8M candidate passages. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k). | |
| To see how it compares to other neural retrievers in French, check out the [*DécouvrIR*](https://huggingface.co/spaces/antoinelouis/decouvrir) leaderboard. | |
| ## Training | |
| #### Data | |
| The model is trained on the French training samples of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO that | |
| contains 8.8M passages and 539K training queries. We sample 12.8M (q, p+, p-) triples from the official ~39.8M [training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset) | |
| with BM25 negatives. | |
| #### Implementation | |
| The model is initialized from the [almanach/camembert-base](https://huggingface.co/almanach/camembert-base) checkpoint and optimized via a combination of the InfoNCE | |
| ranking loss with a temperature of 0.05 and the FLOPS regularization loss with quadratic increase of lambda until step 33k after which it remains constant with lambda_q=3e-4 | |
| and lambda_d=1e-4. The model is fine-tuned on one 80GB NVIDIA H100 GPU for 100k steps using the AdamW optimizer with a batch size of 128, a peak learning rate | |
| of 2e-5 with warm up along the first 4000 steps and linear scheduling. The maximum sequence lengths for questions and passages length were fixed to 32 and 128 tokens. | |
| Relevance scores are computed with the cosine similarity. | |
| ## Citation | |
| ```bibtex | |
| @online{louis2024decouvrir, | |
| author = 'Antoine Louis', | |
| title = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French', | |
| publisher = 'Hugging Face', | |
| month = 'mar', | |
| year = '2024', | |
| url = 'https://huggingface.co/spaces/antoinelouis/decouvrir', | |
| } | |
| ``` |