--- language: - en license: apache-2.0 tags: - code - similarity - ballerina - sentence-transformers - sentence-similarity base_model: nomic-ai/CodeRankEmbed pipeline_tag: sentence-similarity --- # Prompt-Code Similarity Model This model predicts similarity between problem prompts and Ballerina code completions. **Use Case:** Given a coding problem prompt, evaluate how well a Ballerina solution matches the requirements. **Training:** - Passing code (test_reward > 0.25) → HIGH similarity (target = 1.0) - Failing code (test_reward ≤ 0.25) → LOW similarity (target = 0.0) - Only compiled code used (error_reward ≥ 0.25) ## Architecture Based on [CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed) with a binary similarity head. - **Base Model:** nomic-ai/CodeRankEmbed - **Task:** Binary similarity prediction - **Loss:** BCELoss (Binary Cross-Entropy) - **Output:** Similarity score [0, 1] ## Usage ```python import torch from sentence_transformers import SentenceTransformer from transformers import AutoTokenizer # Load tokenizer tokenizer = AutoTokenizer.from_pretrained("nomic-ai/CodeRankEmbed", trust_remote_code=True) # Load your trained model # (You'll need to load the SimilarityModel class and state dict) # Example inference code1 = "..." code2 = "..." # Tokenize enc1 = tokenizer(code1, return_tensors='pt', padding=True, truncation=True, max_length=512) enc2 = tokenizer(code2, return_tensors='pt', padding=True, truncation=True, max_length=512) # Compute similarity with torch.no_grad(): similarity = model(enc1['input_ids'], enc1['attention_mask'], enc2['input_ids'], enc2['attention_mask']) print(f"Similarity: {similarity.item():.3f}") # > 0.7: High similarity (likely to pass) # < 0.3: Low similarity (likely to fail) ``` ## Training Details - **Dataset:** Ballerina code completion logs - **Filtering:** error_reward ≥ 0.25 (compiled successfully) - **Labeling:** test_reward > 0.25 → Pass, else Fail - **Optimizer:** AdamW - **Learning Rate:** 2e-5 - **Batch Size:** 8 - **Epochs:** 5 ## Limitations - Trained specifically on Ballerina code - Performance may vary on unseen problem types - Requires compiled code (syntax errors will affect predictions) ## Citation ```bibtex @misc{coderankembed2024, title={CodeRankEmbed: Code Similarity Learning}, author={Nomic AI}, year={2024}, url={https://huggingface.co/nomic-ai/CodeRankEmbed} } ```