alphaedge-ai
/

multilingual-e5-large-instruct-ell-16384

@@ -1,60 +1,70 @@
----
-pipeline_tag: sentence-similarity
-language: ell
-license: mit
-tags:
-  - trimmed
-library_name: sentence-transformers
-base_model: intfloat/multilingual-e5-large-instruct
-base_model_relation: quantized
-datasets:
-  - alphaedgeai/fineweb-2-trimming
----
-# multilingual-e5-large-instruct-ell-16384
-This model is a 42.73% smaller version of [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) optimized for French language via vocabulary size reduction using the [trimming](https://huggingface.co/blog/introduction-to-trimming) method.
-This trimmed model should perform similarly to the original model with only 16,384 tokens and a much smaller memory footprint. However, it may not perform well for other languages as tokens not commonly used in the selected languages were removed from the vocabulary.
-## Model Statistics
-| Metric | Original | Trimmed | Reduction |
-|--------|----------|---------|-----------|
-| **Vocabulary size** | 250,002 tokens | 16,384 tokens | **93.44%** |
-| **Model size** | 559,890,432 params | 320,665,600 params | **42.73%** |
-![image](https://cdn-uploads.huggingface.co/production/uploads/613b0a62a14099d5afed7830/-AeR_jPanVElTlJmF98wm.png)
-## Mining Dataset Statistics
-- **Number of texts used for mining**: 200,000 texts
-- **Dataset**: [alphaedgeai/fineweb-2-trimming](https://huggingface.co/datasets/AlphaEdgeAI/fineweb-2-trimming)
-## Usage
-```python
-from sentence_transformers import SentenceTransformer
-# Download from the 🤗 Hub
-model = SentenceTransformer("alphaedgeai/multilingual-e5-large-instruct-ell-16384")
-# Run inference with queries and documents
-query = "My query"
-documents = [
-    "Chunk 1",
-    "Chunk 2",
-    "Chunk 3",
-]
-query_embeddings = model.encode_query(query)
-document_embeddings = model.encode_document(documents)
-print(query_embeddings.shape, document_embeddings.shape)
-# Compute similarities to determine a ranking
-similarities = model.similarity(query_embeddings, document_embeddings)
-print(similarities)
-```
-## Citation
-#### Multilingual E5
-```
-@article{wang2024multilingual,
-  title={Multilingual E5 Text Embeddings: A Technical Report},
-  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
-  journal={arXiv preprint arXiv:2402.05672},
-  year={2024}
-}
-```

+---
+pipeline_tag: sentence-similarity
+language: ell
+license: mit
+tags:
+  - trimmed
+library_name: sentence-transformers
+base_model: intfloat/multilingual-e5-large-instruct
+base_model_relation: quantized
+datasets:
+  - lbourdois/fineweb-2-trimming
+---
+# multilingual-e5-large-instruct-ell-16384
+This model is a **42.73% smaller** version of [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) optimized for **Greek** language via vocabulary size reduction using the [trimming](https://huggingface.co/blog/lbourdois/introduction-to-trimming) method.
+This trimmed model should perform similarly to the original model with only 16,384 tokens and a much smaller memory footprint. However, it may not perform well for other languages as tokens not commonly used in the selected languages were removed from the vocabulary.
+## Model Statistics
+| Metric | Original | Trimmed | Reduction |
+|--------|----------|---------|-----------|
+| **Vocabulary size** | 250,037 tokens | 16,384 tokens | **93.44%** |
+| **Model size** | 559,890,432 params | 320,665,600 params | **42.73%** |
+![image](https://raw.githubusercontent.com/lbourdois/blog/refs/heads/master/assets/images/Trimming/me5-large-16384.png)
+## Mining Dataset Statistics
+- **Number of texts used for mining**: 200,000 texts
+- **Dataset**: [lbourdois/fineweb-2-trimming](https://huggingface.co/datasets/lbourdois/fineweb-2-trimming)
+## Usage
+```python
+from sentence_transformers import SentenceTransformer
+# Download from the 🤗 Hub
+model = SentenceTransformer("alphaedge-ai/multilingual-e5-large-instruct-ell-16384")
+# Run inference with queries and documents
+query = "My query in Greek"
+documents = [
+    "Chunk in Greek",
+    "Chunk in Greek",
+    "Chunk in Greek",
+]
+query_embeddings = model.encode_query(query)
+document_embeddings = model.encode_document(documents)
+print(query_embeddings.shape, document_embeddings.shape)
+# Compute similarities to determine a ranking
+similarities = model.similarity(query_embeddings, document_embeddings)
+print(similarities)
+```
+## Citations
+#### Multilingual E5
+```
+@article{wang2024multilingual,
+  title={Multilingual E5 Text Embeddings: A Technical Report},
+  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
+  journal={arXiv preprint arXiv:2402.05672},
+  year={2024}
+}
+```
+#### Trimming blog post
+```
+@misc{hf_blogpost_trimming,
+      title={Introduction to Trimming},
+      author={Loïck BOURDOIS and Tom AARSEN and Bram VANROY and Christopher AKIKI and Woojun JUNG and Manuel ROMERO and Prithiv SAKTHI},
+      year={2026},
+      url={https://huggingface.co/blog/lbourdois/introduction-to-trimming},
+}
+```