llama3_1-8b-ft-msmarco

This model is based on meta-llama/Llama-3.1-8B and was fine-tuned for 1 epoch on the English MSMARCO training data using FlagEmbedding.

This model is released as part of the ReasonEmbed resources. ReasonEmbed studies enhanced text embeddings for reasoning-intensive document retrieval; for more details, please refer to our paper.

Training Data

The model was fine-tuned on the MSMARCO data from hanhainebula/bge-multilingual-gemma2-data.

Training Procedure

Base model: meta-llama/Llama-3.1-8B
Training framework: FlagEmbedding
Training data: MSMARCO
Number of epochs: 1

Tokenizer Note

Following the tokenizer modification discussed in Qwen3-Embedding-0.6B discussion #2, we modified the tokenizer so that it can automatically add the EOS token during tokenization.

License

This model is released under the CC BY-NC 4.0 license.

Citation

If you find this repository useful, please consider giving a star ⭐ and citation:

@article{chen2025reasonembed,
  title={ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval},
  author={Chen, Jianlyu and Lan, Junwei and Li, Chaofan and Lian, Defu and Liu, Zheng},
  journal={arXiv preprint arXiv:2510.08252},
  year={2025}
}