SPLADE (co-condenser-marco) finetuned on merged NQ + PT-BR instruction datasets

This model is a sparse retriever trained with Sentence Transformers' SPLADE stack, using a merged multilingual corpus (English + Portuguese).

Usage

from sentence_transformers import SparseEncoder

sparse_model = SparseEncoder("cnmoro/inference-free-splade-co-condenser-en-ptbr-v2")

sparse_embeddings = sparse_model.encode(["Hello", "World"], show_progress_bar=True)

Base model

Luyu/co-condenser-marco

Training date

Training completed on June 4, 2026.

Dataset composition

The training corpus was built by row-wise concatenation of:

sentence-transformers/natural-questions
cnmoro/GPT4-500k-Augmented-PTBR-Clean
cnmoro/WizardVicuna-PTBR-Instruct-Clean

Final merged size:

Total rows: 869,365
Train rows: 868,365
Eval rows: 1,000
Split seed: 12

Training objective

Loss: SpladeLoss(SparseMultipleNegativesRankingLoss)
Document regularizer weight: 0.03
Query regularizer weight: 0

Core hyperparameters

Epochs: 3
Per-device batch size: 32
Max sequence length: 128
SPLADE pooling chunk size: 64
Learning rate: 2e-5
Warmup ratio: 0.1
Mixed precision: fp16=True
Batch sampler: NO_DUPLICATES
Router mapping: query -> query, answer -> document

Training Hyperparameters

Non-Default Hyperparameters

learning_rate: 2e-05
lr_scheduler_type: cosine
warmup_steps: 0.1
weight_decay: 0.01
gradient_accumulation_steps: 4
max_grad_norm: 5.0
fp16: True
disable_tqdm: True
dataloader_num_workers: 4
batch_sampler: no_duplicates
router_mapping: {'query': 'query', 'answer': 'document'}
learning_rate_mapping: {'\.query\.0\.': 0.001}

Training Time

Training: 11.7 hours

Framework Versions

Python: 3.11.10
Sentence Transformers: 5.5.0
Transformers: 5.8.1
PyTorch: 2.6.0+cu124
Accelerate: 1.13.0
Datasets: 4.5.0
Tokenizers: 0.22.2

Additional Resources

Training and Finetuning Sparse Embedding Models with Sentence Transformers: the end-to-end guide for training or finetuning SPLADE and other sparse encoder models.

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

SpladeLoss

@misc{formal2022distillationhardnegativesampling,
      title={From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
      author={Thibault Formal and Carlos Lassance and Benjamin Piwowarski and Stéphane Clinchant},
      year={2022},
      eprint={2205.04733},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2205.04733},
}

SparseMultipleNegativesRankingLoss

@misc{oord2019representationlearningcontrastivepredictive,
      title={Representation Learning with Contrastive Predictive Coding},
      author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
      year={2019},
      eprint={1807.03748},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/1807.03748},
}

FlopsLoss

@article{paria2020minimizing,
    title={Minimizing flops to learn efficient sparse representations},
    author={Paria, Biswajit and Yeh, Chih-Kuan and Yen, Ian EH and Xu, Ning and Ravikumar, Pradeep and P{'o}czos, Barnab{'a}s},
    journal={arXiv preprint arXiv:2004.05665},
    year={2020}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for cnmoro/inference-free-splade-co-condenser-en-ptbr-v2

Base model

Luyu/co-condenser-marco

Finetuned

(24)

this model

Datasets used to train cnmoro/inference-free-splade-co-condenser-en-ptbr-v2

Papers for cnmoro/inference-free-splade-co-condenser-en-ptbr-v2

From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective

Paper • 2205.04733 • Published May 10, 2022 • 3