Upload CodeCompass-Embed v2 — #1 on CSN-Python (NDCG@10=0.979), 12-task CoIR eval

5a54a35 verified 4 months ago

7.86 kB

	---
	license: apache-2.0
	language:
	- en
	- code
	library_name: transformers
	tags:
	- code
	- embeddings
	- retrieval
	- code-search
	- semantic-search
	- feature-extraction
	- sentence-transformers
	datasets:
	- code-rag-bench/cornstack
	- bigcode/stackoverflow
	- code_search_net
	pipeline_tag: feature-extraction
	base_model: Qwen/Qwen2.5-Coder-0.5B
	model-index:
	- name: CodeCompass-Embed
	results:
	- task:
	type: retrieval
	name: Code Retrieval
	dataset:
	type: CoIR-Retrieval/CodeSearchNet-python
	name: CodeSearchNet Python
	metrics:
	- type: ndcg@10
	value: 0.979
	name: NDCG@10
	- type: mrr@10
	value: 0.976
	name: MRR@10
	- task:
	type: retrieval
	name: Code Translation
	dataset:
	type: CoIR-Retrieval/codetrans-dl
	name: CodeTrans-DL
	metrics:
	- type: ndcg@10
	value: 0.286
	name: NDCG@10
	---

	# CodeCompass-Embed

	CodeCompass-Embed is a 494M-parameter embedding model for semantic code search and retrieval, trained on 86B tokens total. It produces 896-dimensional embeddings optimized for matching natural language queries to code across Python, Java, JavaScript, Go, Ruby, and PHP, achieving state-of-the-art results on the [CoIR code retrieval benchmark](https://github.com/CoIR-team/coir).

	## Model Highlights

	- Code search from natural language — find relevant code snippets across Python, Java, JavaScript, Go, Ruby, PHP
	- Competitive with models 3× smaller and larger — 494M params, 896-dim embeddings
	- Bidirectional attention — all 24 layers converted from causal for better embedding quality
	- Lightweight — runs on consumer GPUs, trained at 512 tokens with RoPE extrapolation for longer inputs
	- Versatile — supports NL→Code, Code→Code, Q&A, and Text→SQL retrieval via instruction templates

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| Qwen2.5-Coder-0.5B \|
	\| Parameters \| 494M \|
	\| Embedding Dimension \| 896 \|
	\| Max Sequence Length \| 512 (training) / 32K (inference) \|
	\| Pooling \| Mean \|
	\| Normalization \| L2 \|
	\| Attention \| Bidirectional (all 24 layers) \|

	## Benchmark Results (CoIR)

	Evaluated on the [CoIR Benchmark](https://github.com/CoIR-team/coir) (ACL 2025). All scores are NDCG@10. Sorted by CSN-Python.

	\| Model \| Params \| CSN-Py \| CodeTrans \| Text2SQL \| SO-QA \| CodeFeedback \| Apps \| Avg \|
	\|-------\|--------\|--------\|-----------\|----------\|-------\|--------------\|------\|-----\|
	\| CodeCompass-Embed (ours) \| 494M \| 0.979 \| 0.286 \| 0.736 \| 0.834 \| 0.814 \| 0.349 \| 0.666 \|
	\| SFR-Embedding-Code \| 400M \| 0.951 \| 0.268 \| 0.995 \| 0.911 \| 0.726 \| 0.221 \| 0.679 \|
	\| Jina-Code-v2 \| 161M \| 0.944 \| 0.274 \| 0.517 \| 0.887 \| 0.698 \| 0.154 \| 0.579 \|
	\| CodeRankEmbed \| 137M \| 0.938 \| 0.260 \| 0.769 \| 0.899 \| 0.717 \| 0.199 \| 0.630 \|
	\| Snowflake-Arctic-Embed-L \| 568M \| 0.915 \| 0.196 \| 0.540 \| 0.872 \| 0.650 \| 0.144 \| 0.553 \|
	\| BGE-M3 \| 568M \| 0.898 \| 0.219 \| 0.573 \| 0.850 \| 0.644 \| 0.145 \| 0.555 \|
	\| BGE-Base-en-v1.5 \| 109M \| 0.894 \| 0.213 \| 0.527 \| 0.858 \| 0.642 \| 0.142 \| 0.546 \|
	\| CodeT5+-110M \| 110M \| 0.870 \| 0.179 \| 0.328 \| 0.815 \| 0.580 \| 0.118 \| 0.482 \|

	### Multi-Language Code Search (CodeSearchNet)

	\| Language \| NDCG@10 \| MRR@10 \|
	\|----------\|---------\|--------\|
	\| Python \| 0.979 \| 0.976 \|
	\| Go \| 0.797 \| 0.767 \|
	\| Java \| 0.639 \| 0.600 \|
	\| PHP \| 0.627 \| 0.585 \|
	\| JavaScript \| 0.621 \| 0.578 \|
	\| Ruby \| 0.579 \| 0.535 \|

	### Full Results (All 12 Tasks)

	\| Task \| NDCG@10 \| MRR@10 \|
	\|------\|---------\|--------\|
	\| codesearchnet-python \| 0.979 \| 0.976 \|
	\| stackoverflow-qa \| 0.834 \| 0.810 \|
	\| codefeedback-st \| 0.814 \| 0.775 \|
	\| codesearchnet-go \| 0.797 \| 0.767 \|
	\| synthetic-text2sql \| 0.736 \| 0.662 \|
	\| codesearchnet-java \| 0.639 \| 0.600 \|
	\| codesearchnet-php \| 0.627 \| 0.585 \|
	\| codesearchnet-javascript \| 0.621 \| 0.578 \|
	\| codesearchnet-ruby \| 0.579 \| 0.535 \|
	\| apps \| 0.349 \| 0.307 \|
	\| codetrans-dl \| 0.286 \| 0.164 \|
	\| cosqa \| 0.209 \| 0.165 \|
	\| Average (12 tasks) \| 0.623 \| 0.577 \|

	## Usage

	### With Transformers

	```python
	import torch
	import torch.nn.functional as F
	from transformers import AutoModel, AutoTokenizer

	# Load model
	model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")

	# CRITICAL: Enable bidirectional attention for embeddings
	for layer in model.model.layers:
	layer.self_attn.is_causal = False

	model.eval()

	def encode(texts, is_query=False):
	# Add instruction prefix for queries
	if is_query:
	texts = [f"Instruct: Find the most relevant code snippet given the following query:\nQuery: {{t}}" for t in texts]

	inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")

	with torch.no_grad():
	outputs = model(**inputs, output_hidden_states=True)
	hidden = outputs.hidden_states[-1]

	# Mean pooling
	mask = inputs["attention_mask"].unsqueeze(-1).float()
	embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)

	# L2 normalize
	embeddings = F.normalize(embeddings, p=2, dim=-1)

	return embeddings

	# Example: Code Search
	query = "How to sort a list in Python"
	code_snippets = [
	"def sort_list(lst):\n return sorted(lst)",
	"def add_numbers(a, b):\n return a + b",
	"def reverse_string(s):\n return s[::-1]",
	]

	query_emb = encode([query], is_query=True)
	code_embs = encode(code_snippets, is_query=False)

	# Compute similarities
	similarities = (query_emb @ code_embs.T).squeeze()
	print(f"Query: {{query}}")
	for i, (code, sim) in enumerate(zip(code_snippets, similarities)):
	print(f" [{{sim:.4f}}] {{code[:50]}}...")
	```

	## Instruction Templates

	For optimal performance, use these instruction prefixes for queries:

	\| Task \| Instruction Template \|
	\|------\|---------------------\|
	\| NL → Code \| `Instruct: Find the most relevant code snippet given the following query:\nQuery: {{query}}` \|
	\| Code → Code \| `Instruct: Find an equivalent code snippet given the following code snippet:\nQuery: {{query}}` \|
	\| Tech Q&A \| `Instruct: Find the most relevant answer given the following question:\nQuery: {{query}}` \|
	\| Text → SQL \| `Instruct: Given a natural language question and schema, find the corresponding SQL query:\nQuery: {{query}}` \|

	Note: Document/corpus texts do NOT need instruction prefixes.

	## Training Details

	Training followed a two-stage approach:

	Stage 1 — Embedding Conversion (8.8M samples):
	Converted Qwen2.5-Coder-0.5B from a causal language model to a bidirectional embedding model. Trained on 8.8M samples spanning CoRNStack (Python, Java, JavaScript, Go, Ruby, PHP), CoderPile, StackOverflow, and synthetic data with mined hard negatives.

	Stage 2 — Hard Negative Refinement (100K samples):
	Continued fine-tuning on a curated 100K-sample subset with hard negatives.

	- Base Model: [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B)
	- Architecture: Bidirectional attention across all 24 layers, mean pooling, L2 normalization
	- Loss: InfoNCE with temperature τ=0.05
	- Effective Batch Size: 1024 (via GradCache)
	- Hardware: NVIDIA H100 (95GB)

	## Limitations

	- Strongest on Python; other languages show lower but competitive performance
	- Weaker on competitive programming tasks (APPS) due to long solution lengths vs. 512 training context
	- May not generalize to low-resource programming languages not seen in training

	## Citation

	```bibtex
	@misc{{codecompass2026,
	author = {{Faisal Mumtaz}},
	title = {{CodeCompass-Embed: A Code Embedding Model for Semantic Code Search}},
	year = {{2026}},
	publisher = {{Hugging Face}},
	url = {{https://huggingface.co/faisalmumtaz/codecompass-embed}}
	}}
	```

	## License

	Apache 2.0