---
license: apache-2.0
datasets:
- cnmoro/GPT4-500k-Augmented-PTBR-Clean
- cnmoro/WizardVicuna-PTBR-Instruct-Clean
- sentence-transformers/natural-questions
language:
- en
- pt
base_model:
- Luyu/co-condenser-marco
pipeline_tag: sentence-similarity
library_name: sentence-transformers
tags:
- splade
- sparse
- embeddings
- bert
- portuguese
- pt
- ptbr
---
  # SPLADE (co-condenser-marco) finetuned on merged NQ + PT-BR instruction datasets

This model is a sparse retriever trained with Sentence Transformers' SPLADE stack, using a merged multilingual corpus (English + Portuguese).

## Usage

```python
from sentence_transformers import SparseEncoder

sparse_model = SparseEncoder("cnmoro/inference-free-splade-co-condenser-en-ptbr-v2")

sparse_embeddings = sparse_model.encode(["Hello", "World"], show_progress_bar=True)
```

## Base model

- `Luyu/co-condenser-marco`

## Training date

- Training completed on **June 4, 2026**.

## Dataset composition

The training corpus was built by row-wise concatenation of:

1. `sentence-transformers/natural-questions`
2. `cnmoro/GPT4-500k-Augmented-PTBR-Clean`
3. `cnmoro/WizardVicuna-PTBR-Instruct-Clean`

Final merged size:

- Total rows: **869,365**
- Train rows: **868,365**
- Eval rows: **1,000**
- Split seed: `12`

## Training objective

- Loss: `SpladeLoss(SparseMultipleNegativesRankingLoss)`
- Document regularizer weight: `0.03`
- Query regularizer weight: `0`

## Core hyperparameters

- Epochs: **3**
- Per-device batch size: **32**
- Max sequence length: **128**
- SPLADE pooling chunk size: **64**
- Learning rate: `2e-5`
- Warmup ratio: `0.1`
- Mixed precision: `fp16=True`
- Batch sampler: `NO_DUPLICATES`
- Router mapping: `query -> query`, `answer -> document`

### Training Hyperparameters
#### Non-Default Hyperparameters

- `learning_rate`: 2e-05
- `lr_scheduler_type`: cosine
- `warmup_steps`: 0.1
- `weight_decay`: 0.01
- `gradient_accumulation_steps`: 4
- `max_grad_norm`: 5.0
- `fp16`: True
- `disable_tqdm`: True
- `dataloader_num_workers`: 4
- `batch_sampler`: no_duplicates
- `router_mapping`: {'query': 'query', 'answer': 'document'}
- `learning_rate_mapping`: {'\\.query\\.0\\.': 0.001}

### Training Time
- **Training**: 11.7 hours

### Framework Versions
- Python: 3.11.10
- Sentence Transformers: 5.5.0
- Transformers: 5.8.1
- PyTorch: 2.6.0+cu124
- Accelerate: 1.13.0
- Datasets: 4.5.0
- Tokenizers: 0.22.2

## Additional Resources

- [Training and Finetuning Sparse Embedding Models with Sentence Transformers](https://huggingface.co/blog/train-sparse-encoder): the end-to-end guide for training or finetuning SPLADE and other sparse encoder models.

## Citation

### BibTeX

#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
```

#### SpladeLoss
```bibtex
@misc{formal2022distillationhardnegativesampling,
      title={From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
      author={Thibault Formal and Carlos Lassance and Benjamin Piwowarski and Stéphane Clinchant},
      year={2022},
      eprint={2205.04733},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2205.04733},
}
```

#### SparseMultipleNegativesRankingLoss
```bibtex
@misc{oord2019representationlearningcontrastivepredictive,
      title={Representation Learning with Contrastive Predictive Coding},
      author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
      year={2019},
      eprint={1807.03748},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/1807.03748},
}
```

#### FlopsLoss
```bibtex
@article{paria2020minimizing,
    title={Minimizing flops to learn efficient sparse representations},
    author={Paria, Biswajit and Yeh, Chih-Kuan and Yen, Ian EH and Xu, Ning and Ravikumar, Pradeep and P{'o}czos, Barnab{'a}s},
    journal={arXiv preprint arXiv:2004.05665},
    year={2020}
}
```