Feature Extraction
Transformers
Safetensors
sentence-transformers
English
code
qwen2
text-generation
embeddings
retrieval
code-search
semantic-search
Eval Results (legacy)
text-embeddings-inference
Instructions to use faisalmumtaz/codecompass-embed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use faisalmumtaz/codecompass-embed with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="faisalmumtaz/codecompass-embed")# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed") model = AutoModelForMultimodalLM.from_pretrained("faisalmumtaz/codecompass-embed") - sentence-transformers
How to use faisalmumtaz/codecompass-embed with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("faisalmumtaz/codecompass-embed") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| - code | |
| library_name: transformers | |
| tags: | |
| - code | |
| - embeddings | |
| - retrieval | |
| - code-search | |
| - semantic-search | |
| - feature-extraction | |
| - sentence-transformers | |
| datasets: | |
| - code-rag-bench/cornstack | |
| - bigcode/stackoverflow | |
| - code_search_net | |
| pipeline_tag: feature-extraction | |
| base_model: Qwen/Qwen2.5-Coder-0.5B | |
| model-index: | |
| - name: CodeCompass-Embed | |
| results: | |
| - task: | |
| type: retrieval | |
| name: Code Retrieval | |
| dataset: | |
| type: CoIR-Retrieval/CodeSearchNet-python | |
| name: CodeSearchNet Python | |
| metrics: | |
| - type: ndcg@10 | |
| value: 0.979 | |
| name: NDCG@10 | |
| - type: mrr@10 | |
| value: 0.976 | |
| name: MRR@10 | |
| - task: | |
| type: retrieval | |
| name: Code Translation | |
| dataset: | |
| type: CoIR-Retrieval/codetrans-dl | |
| name: CodeTrans-DL | |
| metrics: | |
| - type: ndcg@10 | |
| value: 0.286 | |
| name: NDCG@10 | |
| # CodeCompass-Embed | |
| **CodeCompass-Embed** is a 494M-parameter embedding model for semantic code search and retrieval, trained on 86B tokens total. It produces 896-dimensional embeddings optimized for matching natural language queries to code across Python, Java, JavaScript, Go, Ruby, and PHP, achieving state-of-the-art results on the [CoIR code retrieval benchmark](https://github.com/CoIR-team/coir). | |
| ## Model Highlights | |
| - **Code search from natural language** — find relevant code snippets across Python, Java, JavaScript, Go, Ruby, PHP | |
| - **Competitive with models 3× smaller and larger** — 494M params, 896-dim embeddings | |
| - **Bidirectional attention** — all 24 layers converted from causal for better embedding quality | |
| - **Lightweight** — runs on consumer GPUs, trained at 512 tokens with RoPE extrapolation for longer inputs | |
| - **Versatile** — supports NL→Code, Code→Code, Q&A, and Text→SQL retrieval via instruction templates | |
| ## Model Details | |
| | Property | Value | | |
| |----------|-------| | |
| | Base Model | Qwen2.5-Coder-0.5B | | |
| | Parameters | 494M | | |
| | Embedding Dimension | 896 | | |
| | Max Sequence Length | 512 (training) / 32K (inference) | | |
| | Pooling | Mean | | |
| | Normalization | L2 | | |
| | Attention | Bidirectional (all 24 layers) | | |
| ## Benchmark Results (CoIR) | |
| Evaluated on the [CoIR Benchmark](https://github.com/CoIR-team/coir) (ACL 2025). All scores are NDCG@10. Sorted by CSN-Python. | |
| | Model | Params | CSN-Py | CodeTrans | Text2SQL | SO-QA | CodeFeedback | Apps | Avg | | |
| |-------|--------|--------|-----------|----------|-------|--------------|------|-----| | |
| | CodeCompass-Embed (ours) | 494M | **0.979** | **0.286** | 0.736 | 0.834 | **0.814** | **0.349** | 0.666 | | |
| | SFR-Embedding-Code | 400M | 0.951 | 0.268 | **0.995** | **0.911** | 0.726 | 0.221 | **0.679** | | |
| | Jina-Code-v2 | 161M | 0.944 | 0.274 | 0.517 | 0.887 | 0.698 | 0.154 | 0.579 | | |
| | CodeRankEmbed | 137M | 0.938 | 0.260 | 0.769 | 0.899 | 0.717 | 0.199 | 0.630 | | |
| | Snowflake-Arctic-Embed-L | 568M | 0.915 | 0.196 | 0.540 | 0.872 | 0.650 | 0.144 | 0.553 | | |
| | BGE-M3 | 568M | 0.898 | 0.219 | 0.573 | 0.850 | 0.644 | 0.145 | 0.555 | | |
| | BGE-Base-en-v1.5 | 109M | 0.894 | 0.213 | 0.527 | 0.858 | 0.642 | 0.142 | 0.546 | | |
| | CodeT5+-110M | 110M | 0.870 | 0.179 | 0.328 | 0.815 | 0.580 | 0.118 | 0.482 | | |
| ### Multi-Language Code Search (CodeSearchNet) | |
| | Language | NDCG@10 | MRR@10 | | |
| |----------|---------|--------| | |
| | **Python** | **0.979** | **0.976** | | |
| | Go | 0.797 | 0.767 | | |
| | Java | 0.639 | 0.600 | | |
| | PHP | 0.627 | 0.585 | | |
| | JavaScript | 0.621 | 0.578 | | |
| | Ruby | 0.579 | 0.535 | | |
| ### Full Results (All 12 Tasks) | |
| | Task | NDCG@10 | MRR@10 | | |
| |------|---------|--------| | |
| | codesearchnet-python | 0.979 | 0.976 | | |
| | stackoverflow-qa | 0.834 | 0.810 | | |
| | codefeedback-st | 0.814 | 0.775 | | |
| | codesearchnet-go | 0.797 | 0.767 | | |
| | synthetic-text2sql | 0.736 | 0.662 | | |
| | codesearchnet-java | 0.639 | 0.600 | | |
| | codesearchnet-php | 0.627 | 0.585 | | |
| | codesearchnet-javascript | 0.621 | 0.578 | | |
| | codesearchnet-ruby | 0.579 | 0.535 | | |
| | apps | 0.349 | 0.307 | | |
| | codetrans-dl | 0.286 | 0.164 | | |
| | cosqa | 0.209 | 0.165 | | |
| | **Average (12 tasks)** | **0.623** | **0.577** | | |
| ## Usage | |
| ### With Transformers | |
| ```python | |
| import torch | |
| import torch.nn.functional as F | |
| from transformers import AutoModel, AutoTokenizer | |
| # Load model | |
| model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True) | |
| tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed") | |
| # CRITICAL: Enable bidirectional attention for embeddings | |
| for layer in model.model.layers: | |
| layer.self_attn.is_causal = False | |
| model.eval() | |
| def encode(texts, is_query=False): | |
| # Add instruction prefix for queries | |
| if is_query: | |
| texts = [f"Instruct: Find the most relevant code snippet given the following query:\nQuery: {{t}}" for t in texts] | |
| inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt") | |
| with torch.no_grad(): | |
| outputs = model(**inputs, output_hidden_states=True) | |
| hidden = outputs.hidden_states[-1] | |
| # Mean pooling | |
| mask = inputs["attention_mask"].unsqueeze(-1).float() | |
| embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9) | |
| # L2 normalize | |
| embeddings = F.normalize(embeddings, p=2, dim=-1) | |
| return embeddings | |
| # Example: Code Search | |
| query = "How to sort a list in Python" | |
| code_snippets = [ | |
| "def sort_list(lst):\n return sorted(lst)", | |
| "def add_numbers(a, b):\n return a + b", | |
| "def reverse_string(s):\n return s[::-1]", | |
| ] | |
| query_emb = encode([query], is_query=True) | |
| code_embs = encode(code_snippets, is_query=False) | |
| # Compute similarities | |
| similarities = (query_emb @ code_embs.T).squeeze() | |
| print(f"Query: {{query}}") | |
| for i, (code, sim) in enumerate(zip(code_snippets, similarities)): | |
| print(f" [{{sim:.4f}}] {{code[:50]}}...") | |
| ``` | |
| ## Instruction Templates | |
| For optimal performance, use these instruction prefixes for queries: | |
| | Task | Instruction Template | | |
| |------|---------------------| | |
| | NL → Code | `Instruct: Find the most relevant code snippet given the following query:\nQuery: {{query}}` | | |
| | Code → Code | `Instruct: Find an equivalent code snippet given the following code snippet:\nQuery: {{query}}` | | |
| | Tech Q&A | `Instruct: Find the most relevant answer given the following question:\nQuery: {{query}}` | | |
| | Text → SQL | `Instruct: Given a natural language question and schema, find the corresponding SQL query:\nQuery: {{query}}` | | |
| **Note**: Document/corpus texts do NOT need instruction prefixes. | |
| ## Training Details | |
| Training followed a two-stage approach: | |
| **Stage 1 — Embedding Conversion** (8.8M samples): | |
| Converted Qwen2.5-Coder-0.5B from a causal language model to a bidirectional embedding model. Trained on 8.8M samples spanning CoRNStack (Python, Java, JavaScript, Go, Ruby, PHP), CoderPile, StackOverflow, and synthetic data with mined hard negatives. | |
| **Stage 2 — Hard Negative Refinement** (100K samples): | |
| Continued fine-tuning on a curated 100K-sample subset with hard negatives. | |
| - **Base Model**: [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B) | |
| - **Architecture**: Bidirectional attention across all 24 layers, mean pooling, L2 normalization | |
| - **Loss**: InfoNCE with temperature τ=0.05 | |
| - **Effective Batch Size**: 1024 (via GradCache) | |
| - **Hardware**: NVIDIA H100 (95GB) | |
| ## Limitations | |
| - Strongest on Python; other languages show lower but competitive performance | |
| - Weaker on competitive programming tasks (APPS) due to long solution lengths vs. 512 training context | |
| - May not generalize to low-resource programming languages not seen in training | |
| ## Citation | |
| ```bibtex | |
| @misc{{codecompass2026, | |
| author = {{Faisal Mumtaz}}, | |
| title = {{CodeCompass-Embed: A Code Embedding Model for Semantic Code Search}}, | |
| year = {{2026}}, | |
| publisher = {{Hugging Face}}, | |
| url = {{https://huggingface.co/faisalmumtaz/codecompass-embed}} | |
| }} | |
| ``` | |
| ## License | |
| Apache 2.0 | |