codecompass-embed / README.md
faisalmumtaz's picture
Upload CodeCompass-Embed v2 — #1 on CSN-Python (NDCG@10=0.979), 12-task CoIR eval
5a54a35 verified
|
Raw
History Blame Contribute Delete
7.86 kB
---
license: apache-2.0
language:
- en
- code
library_name: transformers
tags:
- code
- embeddings
- retrieval
- code-search
- semantic-search
- feature-extraction
- sentence-transformers
datasets:
- code-rag-bench/cornstack
- bigcode/stackoverflow
- code_search_net
pipeline_tag: feature-extraction
base_model: Qwen/Qwen2.5-Coder-0.5B
model-index:
- name: CodeCompass-Embed
results:
- task:
type: retrieval
name: Code Retrieval
dataset:
type: CoIR-Retrieval/CodeSearchNet-python
name: CodeSearchNet Python
metrics:
- type: ndcg@10
value: 0.979
name: NDCG@10
- type: mrr@10
value: 0.976
name: MRR@10
- task:
type: retrieval
name: Code Translation
dataset:
type: CoIR-Retrieval/codetrans-dl
name: CodeTrans-DL
metrics:
- type: ndcg@10
value: 0.286
name: NDCG@10
---
# CodeCompass-Embed
**CodeCompass-Embed** is a 494M-parameter embedding model for semantic code search and retrieval, trained on 86B tokens total. It produces 896-dimensional embeddings optimized for matching natural language queries to code across Python, Java, JavaScript, Go, Ruby, and PHP, achieving state-of-the-art results on the [CoIR code retrieval benchmark](https://github.com/CoIR-team/coir).
## Model Highlights
- **Code search from natural language** — find relevant code snippets across Python, Java, JavaScript, Go, Ruby, PHP
- **Competitive with models 3× smaller and larger** — 494M params, 896-dim embeddings
- **Bidirectional attention** — all 24 layers converted from causal for better embedding quality
- **Lightweight** — runs on consumer GPUs, trained at 512 tokens with RoPE extrapolation for longer inputs
- **Versatile** — supports NL→Code, Code→Code, Q&A, and Text→SQL retrieval via instruction templates
## Model Details
| Property | Value |
|----------|-------|
| Base Model | Qwen2.5-Coder-0.5B |
| Parameters | 494M |
| Embedding Dimension | 896 |
| Max Sequence Length | 512 (training) / 32K (inference) |
| Pooling | Mean |
| Normalization | L2 |
| Attention | Bidirectional (all 24 layers) |
## Benchmark Results (CoIR)
Evaluated on the [CoIR Benchmark](https://github.com/CoIR-team/coir) (ACL 2025). All scores are NDCG@10. Sorted by CSN-Python.
| Model | Params | CSN-Py | CodeTrans | Text2SQL | SO-QA | CodeFeedback | Apps | Avg |
|-------|--------|--------|-----------|----------|-------|--------------|------|-----|
| CodeCompass-Embed (ours) | 494M | **0.979** | **0.286** | 0.736 | 0.834 | **0.814** | **0.349** | 0.666 |
| SFR-Embedding-Code | 400M | 0.951 | 0.268 | **0.995** | **0.911** | 0.726 | 0.221 | **0.679** |
| Jina-Code-v2 | 161M | 0.944 | 0.274 | 0.517 | 0.887 | 0.698 | 0.154 | 0.579 |
| CodeRankEmbed | 137M | 0.938 | 0.260 | 0.769 | 0.899 | 0.717 | 0.199 | 0.630 |
| Snowflake-Arctic-Embed-L | 568M | 0.915 | 0.196 | 0.540 | 0.872 | 0.650 | 0.144 | 0.553 |
| BGE-M3 | 568M | 0.898 | 0.219 | 0.573 | 0.850 | 0.644 | 0.145 | 0.555 |
| BGE-Base-en-v1.5 | 109M | 0.894 | 0.213 | 0.527 | 0.858 | 0.642 | 0.142 | 0.546 |
| CodeT5+-110M | 110M | 0.870 | 0.179 | 0.328 | 0.815 | 0.580 | 0.118 | 0.482 |
### Multi-Language Code Search (CodeSearchNet)
| Language | NDCG@10 | MRR@10 |
|----------|---------|--------|
| **Python** | **0.979** | **0.976** |
| Go | 0.797 | 0.767 |
| Java | 0.639 | 0.600 |
| PHP | 0.627 | 0.585 |
| JavaScript | 0.621 | 0.578 |
| Ruby | 0.579 | 0.535 |
### Full Results (All 12 Tasks)
| Task | NDCG@10 | MRR@10 |
|------|---------|--------|
| codesearchnet-python | 0.979 | 0.976 |
| stackoverflow-qa | 0.834 | 0.810 |
| codefeedback-st | 0.814 | 0.775 |
| codesearchnet-go | 0.797 | 0.767 |
| synthetic-text2sql | 0.736 | 0.662 |
| codesearchnet-java | 0.639 | 0.600 |
| codesearchnet-php | 0.627 | 0.585 |
| codesearchnet-javascript | 0.621 | 0.578 |
| codesearchnet-ruby | 0.579 | 0.535 |
| apps | 0.349 | 0.307 |
| codetrans-dl | 0.286 | 0.164 |
| cosqa | 0.209 | 0.165 |
| **Average (12 tasks)** | **0.623** | **0.577** |
## Usage
### With Transformers
```python
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
# Load model
model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")
# CRITICAL: Enable bidirectional attention for embeddings
for layer in model.model.layers:
layer.self_attn.is_causal = False
model.eval()
def encode(texts, is_query=False):
# Add instruction prefix for queries
if is_query:
texts = [f"Instruct: Find the most relevant code snippet given the following query:\nQuery: {{t}}" for t in texts]
inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
hidden = outputs.hidden_states[-1]
# Mean pooling
mask = inputs["attention_mask"].unsqueeze(-1).float()
embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
# L2 normalize
embeddings = F.normalize(embeddings, p=2, dim=-1)
return embeddings
# Example: Code Search
query = "How to sort a list in Python"
code_snippets = [
"def sort_list(lst):\n return sorted(lst)",
"def add_numbers(a, b):\n return a + b",
"def reverse_string(s):\n return s[::-1]",
]
query_emb = encode([query], is_query=True)
code_embs = encode(code_snippets, is_query=False)
# Compute similarities
similarities = (query_emb @ code_embs.T).squeeze()
print(f"Query: {{query}}")
for i, (code, sim) in enumerate(zip(code_snippets, similarities)):
print(f" [{{sim:.4f}}] {{code[:50]}}...")
```
## Instruction Templates
For optimal performance, use these instruction prefixes for queries:
| Task | Instruction Template |
|------|---------------------|
| NL → Code | `Instruct: Find the most relevant code snippet given the following query:\nQuery: {{query}}` |
| Code → Code | `Instruct: Find an equivalent code snippet given the following code snippet:\nQuery: {{query}}` |
| Tech Q&A | `Instruct: Find the most relevant answer given the following question:\nQuery: {{query}}` |
| Text → SQL | `Instruct: Given a natural language question and schema, find the corresponding SQL query:\nQuery: {{query}}` |
**Note**: Document/corpus texts do NOT need instruction prefixes.
## Training Details
Training followed a two-stage approach:
**Stage 1 — Embedding Conversion** (8.8M samples):
Converted Qwen2.5-Coder-0.5B from a causal language model to a bidirectional embedding model. Trained on 8.8M samples spanning CoRNStack (Python, Java, JavaScript, Go, Ruby, PHP), CoderPile, StackOverflow, and synthetic data with mined hard negatives.
**Stage 2 — Hard Negative Refinement** (100K samples):
Continued fine-tuning on a curated 100K-sample subset with hard negatives.
- **Base Model**: [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B)
- **Architecture**: Bidirectional attention across all 24 layers, mean pooling, L2 normalization
- **Loss**: InfoNCE with temperature τ=0.05
- **Effective Batch Size**: 1024 (via GradCache)
- **Hardware**: NVIDIA H100 (95GB)
## Limitations
- Strongest on Python; other languages show lower but competitive performance
- Weaker on competitive programming tasks (APPS) due to long solution lengths vs. 512 training context
- May not generalize to low-resource programming languages not seen in training
## Citation
```bibtex
@misc{{codecompass2026,
author = {{Faisal Mumtaz}},
title = {{CodeCompass-Embed: A Code Embedding Model for Semantic Code Search}},
year = {{2026}},
publisher = {{Hugging Face}},
url = {{https://huggingface.co/faisalmumtaz/codecompass-embed}}
}}
```
## License
Apache 2.0