--- license: apache-2.0 datasets: - cnmoro/GPT4-500k-Augmented-PTBR-Clean - cnmoro/WizardVicuna-PTBR-Instruct-Clean - sentence-transformers/natural-questions language: - en - pt base_model: - Luyu/co-condenser-marco pipeline_tag: sentence-similarity library_name: sentence-transformers tags: - splade - sparse - embeddings - bert - portuguese - pt - ptbr --- # SPLADE (co-condenser-marco) finetuned on merged NQ + PT-BR instruction datasets This model is a sparse retriever trained with Sentence Transformers' SPLADE stack, using a merged multilingual corpus (English + Portuguese). ## Usage ```python from sentence_transformers import SparseEncoder sparse_model = SparseEncoder("cnmoro/inference-free-splade-co-condenser-en-ptbr-v2") sparse_embeddings = sparse_model.encode(["Hello", "World"], show_progress_bar=True) ``` ## Base model - `Luyu/co-condenser-marco` ## Training date - Training completed on **June 4, 2026**. ## Dataset composition The training corpus was built by row-wise concatenation of: 1. `sentence-transformers/natural-questions` 2. `cnmoro/GPT4-500k-Augmented-PTBR-Clean` 3. `cnmoro/WizardVicuna-PTBR-Instruct-Clean` Final merged size: - Total rows: **869,365** - Train rows: **868,365** - Eval rows: **1,000** - Split seed: `12` ## Training objective - Loss: `SpladeLoss(SparseMultipleNegativesRankingLoss)` - Document regularizer weight: `0.03` - Query regularizer weight: `0` ## Core hyperparameters - Epochs: **3** - Per-device batch size: **32** - Max sequence length: **128** - SPLADE pooling chunk size: **64** - Learning rate: `2e-5` - Warmup ratio: `0.1` - Mixed precision: `fp16=True` - Batch sampler: `NO_DUPLICATES` - Router mapping: `query -> query`, `answer -> document` ### Training Hyperparameters #### Non-Default Hyperparameters - `learning_rate`: 2e-05 - `lr_scheduler_type`: cosine - `warmup_steps`: 0.1 - `weight_decay`: 0.01 - `gradient_accumulation_steps`: 4 - `max_grad_norm`: 5.0 - `fp16`: True - `disable_tqdm`: True - `dataloader_num_workers`: 4 - `batch_sampler`: no_duplicates - `router_mapping`: {'query': 'query', 'answer': 'document'} - `learning_rate_mapping`: {'\\.query\\.0\\.': 0.001} ### Training Time - **Training**: 11.7 hours ### Framework Versions - Python: 3.11.10 - Sentence Transformers: 5.5.0 - Transformers: 5.8.1 - PyTorch: 2.6.0+cu124 - Accelerate: 1.13.0 - Datasets: 4.5.0 - Tokenizers: 0.22.2 ## Additional Resources - [Training and Finetuning Sparse Embedding Models with Sentence Transformers](https://huggingface.co/blog/train-sparse-encoder): the end-to-end guide for training or finetuning SPLADE and other sparse encoder models. ## Citation ### BibTeX #### Sentence Transformers ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ``` #### SpladeLoss ```bibtex @misc{formal2022distillationhardnegativesampling, title={From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective}, author={Thibault Formal and Carlos Lassance and Benjamin Piwowarski and Stéphane Clinchant}, year={2022}, eprint={2205.04733}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2205.04733}, } ``` #### SparseMultipleNegativesRankingLoss ```bibtex @misc{oord2019representationlearningcontrastivepredictive, title={Representation Learning with Contrastive Predictive Coding}, author={Aaron van den Oord and Yazhe Li and Oriol Vinyals}, year={2019}, eprint={1807.03748}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/1807.03748}, } ``` #### FlopsLoss ```bibtex @article{paria2020minimizing, title={Minimizing flops to learn efficient sparse representations}, author={Paria, Biswajit and Yeh, Chih-Kuan and Yen, Ian EH and Xu, Ning and Ravikumar, Pradeep and P{'o}czos, Barnab{'a}s}, journal={arXiv preprint arXiv:2004.05665}, year={2020} } ```