Feature Extraction
Transformers
Safetensors
sentence-transformers
English
code
qwen2
text-generation
embeddings
retrieval
code-search
semantic-search
Eval Results (legacy)
text-embeddings-inference
Instructions to use faisalmumtaz/codecompass-embed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use faisalmumtaz/codecompass-embed with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="faisalmumtaz/codecompass-embed")# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed") model = AutoModelForMultimodalLM.from_pretrained("faisalmumtaz/codecompass-embed") - sentence-transformers
How to use faisalmumtaz/codecompass-embed with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("faisalmumtaz/codecompass-embed") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
Upload CodeCompass-Embed v2 — #1 on CSN-Python (NDCG@10=0.979), 12-task CoIR eval
Browse files
README.md
CHANGED
|
@@ -186,11 +186,17 @@ For optimal performance, use these instruction prefixes for queries:
|
|
| 186 |
|
| 187 |
## Training Details
|
| 188 |
|
| 189 |
-
|
| 190 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 191 |
- **Architecture**: Bidirectional attention across all 24 layers, mean pooling, L2 normalization
|
| 192 |
- **Loss**: InfoNCE with temperature τ=0.05
|
| 193 |
-
- **Hard Negatives**: Up to 8 per sample (GPT-validated)
|
| 194 |
- **Effective Batch Size**: 1024 (via GradCache)
|
| 195 |
- **Hardware**: NVIDIA H100 (95GB)
|
| 196 |
|
|
|
|
| 186 |
|
| 187 |
## Training Details
|
| 188 |
|
| 189 |
+
Training followed a two-stage approach:
|
| 190 |
+
|
| 191 |
+
**Stage 1 — Embedding Conversion** (8.8M samples):
|
| 192 |
+
Converted Qwen2.5-Coder-0.5B from a causal language model to a bidirectional embedding model. Trained on 8.8M samples spanning CoRNStack (Python, Java, JavaScript, Go, Ruby, PHP), CoderPile, StackOverflow, and synthetic SQL data with mined hard negatives.
|
| 193 |
+
|
| 194 |
+
**Stage 2 — Hard Negative Refinement** (100K samples):
|
| 195 |
+
Continued fine-tuning on a curated 100K-sample subset with up to 8 hard negatives per sample.
|
| 196 |
+
|
| 197 |
+
- **Base Model**: [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B)
|
| 198 |
- **Architecture**: Bidirectional attention across all 24 layers, mean pooling, L2 normalization
|
| 199 |
- **Loss**: InfoNCE with temperature τ=0.05
|
|
|
|
| 200 |
- **Effective Batch Size**: 1024 (via GradCache)
|
| 201 |
- **Hardware**: NVIDIA H100 (95GB)
|
| 202 |
|