--- tags: - sentence-transformers - sentence-similarity - feature-extraction - dense - generated_from_trainer - dataset_size:800640 - loss:MultipleNegativesRankingLoss base_model: Shuu12121/Owl-ph2-base-len2048 pipeline_tag: sentence-similarity library_name: sentence-transformers --- # Shuu12121/Owl-ph2-len2048 🦉 ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Base model:** [Shuu12121/Owl-ph2-base-len2048](https://huggingface.co/Shuu12121/Owl-ph2-base-len2048) - **Maximum Sequence Length:** 1024 tokens (2048 tokens during pretraining) - **Output Dimensionality:** 768 - **Similarity Function:** Cosine Similarity This model is a SentenceTransformer variant of **Shuu12121/Owl-ph2-base-len2048**. It was trained on the **Owl corpus** for **code search** and **code-text retrieval**. The training data consists of roughly **100,000 samples per language** (**800,640 pairs** in total), and the model was trained for **1 epoch** with a **learning rate of 1e-5**. ### Model Sources - **Base model:** [Shuu12121/Owl-ph2-base-len2048](https://huggingface.co/Shuu12121/Owl-ph2-base-len2048) - **Sentence Transformers:** [Sentence Transformers Documentation](https://sbert.net) ### Full Model Architecture ```text SentenceTransformer( (0): Transformer({'max_seq_length': 1024, 'do_lower_case': False, 'architecture': 'ModernBertModel'}) (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) ) ```` ## Intended Uses This model is intended for: * code search * code-text retrieval * semantic similarity * dense embedding generation for source code and natural language ## Usage ### Direct Usage (Sentence Transformers) ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("Shuu12121/Owl-ph2-len2048") ``` ## Training Details ### Training Dataset This model was trained on the [**Owl corpus**](https://huggingface.co/collections/Shuu12121/codesearch-datasets), a dataset constructed for code search and code-text retrieval. The training set contains approximately **100,000 samples per language**, resulting in **800,640 training pairs** in total. ### Training Hyperparameters * **Learning rate:** 1e-5 * **Epochs:** 1 * **Loss:** MultipleNegativesRankingLoss ## Integrations ### Owl-CLI This model is used as the embedding model in **[Owl-CLI](https://github.com/Shun0212/Owl-CLI)**, a command-line tool for semantic code search. Owl-CLI indexes source code at the **function level**, generates dense embeddings using this model, and performs **vector similarity search** to retrieve relevant code for natural language queries. Key features of Owl-CLI include: - **Semantic code search** using dense embeddings - **Function-level indexing** with file paths and line numbers - **Automatic indexing** on first search - **Differential embedding cache** to avoid re-embedding unchanged files - **JSON output** for tool integration - **MCP server support** for integration with AI coding agents (e.g., Claude Code) Repository: https://github.com/Shun0212/Owl-CLI