llama3_1-8b-ft-msmarco

This model is based on meta-llama/Llama-3.1-8B and was fine-tuned for 1 epoch on the English MSMARCO training data using FlagEmbedding.

This model is released as part of the ReasonEmbed resources. ReasonEmbed studies enhanced text embeddings for reasoning-intensive document retrieval; for more details, please refer to our paper.

Training Data

The model was fine-tuned on the MSMARCO data from hanhainebula/bge-multilingual-gemma2-data.

Training Procedure

Tokenizer Note

Following the tokenizer modification discussed in Qwen3-Embedding-0.6B discussion #2, we modified the tokenizer so that it can automatically add the EOS token during tokenization.

License

This model is released under the CC BY-NC 4.0 license.

Citation

If you find this repository useful, please consider giving a star ⭐ and citation:

@article{chen2025reasonembed,
  title={ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval},
  author={Chen, Jianlyu and Lan, Junwei and Li, Chaofan and Lian, Defu and Liu, Zheng},
  journal={arXiv preprint arXiv:2510.08252},
  year={2025}
}
Downloads last month
21
Safetensors
Model size
8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hanhainebula/llama3_1-8b-ft-msmarco

Finetuned
(1420)
this model
Finetunes
1 model

Dataset used to train hanhainebula/llama3_1-8b-ft-msmarco

Collection including hanhainebula/llama3_1-8b-ft-msmarco

Paper for hanhainebula/llama3_1-8b-ft-msmarco