Sentence Similarity
sentence-transformers
Safetensors
English
Portuguese
splade
sparse
embeddings
bert
portuguese
ptbr

SPLADE (co-condenser-marco) finetuned on merged NQ + PT-BR instruction datasets

This model is a sparse retriever trained with Sentence Transformers' SPLADE stack, using a merged multilingual corpus (English + Portuguese).

Usage

from sentence_transformers import SparseEncoder

sparse_model = SparseEncoder("cnmoro/inference-free-splade-co-condenser-en-ptbr-v2")

sparse_embeddings = sparse_model.encode(["Hello", "World"], show_progress_bar=True)

Base model

  • Luyu/co-condenser-marco

Training date

  • Training completed on June 4, 2026.

Dataset composition

The training corpus was built by row-wise concatenation of:

  1. sentence-transformers/natural-questions
  2. cnmoro/GPT4-500k-Augmented-PTBR-Clean
  3. cnmoro/WizardVicuna-PTBR-Instruct-Clean

Final merged size:

  • Total rows: 869,365
  • Train rows: 868,365
  • Eval rows: 1,000
  • Split seed: 12

Training objective

  • Loss: SpladeLoss(SparseMultipleNegativesRankingLoss)
  • Document regularizer weight: 0.03
  • Query regularizer weight: 0

Core hyperparameters

  • Epochs: 3
  • Per-device batch size: 32
  • Max sequence length: 128
  • SPLADE pooling chunk size: 64
  • Learning rate: 2e-5
  • Warmup ratio: 0.1
  • Mixed precision: fp16=True
  • Batch sampler: NO_DUPLICATES
  • Router mapping: query -> query, answer -> document

Training Hyperparameters

Non-Default Hyperparameters

  • learning_rate: 2e-05
  • lr_scheduler_type: cosine
  • warmup_steps: 0.1
  • weight_decay: 0.01
  • gradient_accumulation_steps: 4
  • max_grad_norm: 5.0
  • fp16: True
  • disable_tqdm: True
  • dataloader_num_workers: 4
  • batch_sampler: no_duplicates
  • router_mapping: {'query': 'query', 'answer': 'document'}
  • learning_rate_mapping: {'\.query\.0\.': 0.001}

Training Time

  • Training: 11.7 hours

Framework Versions

  • Python: 3.11.10
  • Sentence Transformers: 5.5.0
  • Transformers: 5.8.1
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.13.0
  • Datasets: 4.5.0
  • Tokenizers: 0.22.2

Additional Resources

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

SpladeLoss

@misc{formal2022distillationhardnegativesampling,
      title={From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
      author={Thibault Formal and Carlos Lassance and Benjamin Piwowarski and Stéphane Clinchant},
      year={2022},
      eprint={2205.04733},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2205.04733},
}

SparseMultipleNegativesRankingLoss

@misc{oord2019representationlearningcontrastivepredictive,
      title={Representation Learning with Contrastive Predictive Coding},
      author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
      year={2019},
      eprint={1807.03748},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/1807.03748},
}

FlopsLoss

@article{paria2020minimizing,
    title={Minimizing flops to learn efficient sparse representations},
    author={Paria, Biswajit and Yeh, Chih-Kuan and Yen, Ian EH and Xu, Ning and Ravikumar, Pradeep and P{'o}czos, Barnab{'a}s},
    journal={arXiv preprint arXiv:2004.05665},
    year={2020}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cnmoro/inference-free-splade-co-condenser-en-ptbr-v2

Finetuned
(24)
this model

Datasets used to train cnmoro/inference-free-splade-co-condenser-en-ptbr-v2

Papers for cnmoro/inference-free-splade-co-condenser-en-ptbr-v2