Sentence Similarity
sentence-transformers
Safetensors
modernbert
feature-extraction
dense
Generated from Trainer
dataset_size:800640
loss:MultipleNegativesRankingLoss
text-embeddings-inference
Instructions to use Shuu12121/Owl-ph2-len2048 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Shuu12121/Owl-ph2-len2048 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Shuu12121/Owl-ph2-len2048") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
| tags: | |
| - sentence-transformers | |
| - sentence-similarity | |
| - feature-extraction | |
| - dense | |
| - generated_from_trainer | |
| - dataset_size:800640 | |
| - loss:MultipleNegativesRankingLoss | |
| base_model: Shuu12121/Owl-ph2-base-len2048 | |
| pipeline_tag: sentence-similarity | |
| library_name: sentence-transformers | |
| # Shuu12121/Owl-ph2-len2048 π¦ | |
| ``` | |
| βββββββ βββ ββββββ βββββββ βββ βββ | |
| ββββββββββββ ββββββ ββββββββ βββ βββ ,______, | |
| βββ ββββββ ββ ββββββ βββββββ βββ βββ βββ ( O v O ) | |
| βββ ββββββββββββββββ βββββββ βββ βββ βββ / V \ | |
| βββββββββββββββββββββββββββ ββββββββ ββββββββ βββ /( )\ | |
| βββββββ ββββββββ ββββββββ βββββββ ββββββββ βββ ^^ ^^ | |
| ``` | |
| ## Model Details | |
| ### Model Description | |
| - **Model Type:** Sentence Transformer | |
| - **Base model:** [Shuu12121/Owl-ph2-base-len2048](https://huggingface.co/Shuu12121/Owl-ph2-base-len2048) | |
| - **Maximum Sequence Length:** 1024 tokens (2048 tokens during pretraining) | |
| - **Output Dimensionality:** 768 | |
| - **Similarity Function:** Cosine Similarity | |
| This model is a SentenceTransformer variant of **Shuu12121/Owl-ph2-base-len2048**. | |
| It was trained on the **Owl corpus** for **code search** and **code-text retrieval**. | |
| The training data consists of roughly **100,000 samples per language** (**800,640 pairs** in total), and the model was trained for **1 epoch** with a **learning rate of 1e-5**. | |
| ### Model Sources | |
| - **Base model:** [Shuu12121/Owl-ph2-base-len2048](https://huggingface.co/Shuu12121/Owl-ph2-base-len2048) | |
| - **Sentence Transformers:** [Sentence Transformers Documentation](https://sbert.net) | |
| ### Full Model Architecture | |
| ```text | |
| SentenceTransformer( | |
| (0): Transformer({'max_seq_length': 1024, 'do_lower_case': False, 'architecture': 'ModernBertModel'}) | |
| (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) | |
| ) | |
| ``` | |
| ## Intended Uses | |
| This model is intended for: | |
| * code search | |
| * code-text retrieval | |
| * semantic similarity | |
| * dense embedding generation for source code and natural language | |
| ## Usage | |
| ### Direct Usage (Sentence Transformers) | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer("Shuu12121/Owl-ph2-len2048") | |
| ``` | |
| ## Training Details | |
| ### Training Dataset | |
| This model was trained on the [**Owl corpus**](https://huggingface.co/collections/Shuu12121/codesearch-datasets), a dataset constructed for code search and code-text retrieval. | |
| The training set contains approximately **100,000 samples per language**, resulting in **800,640 training pairs** in total. | |
| ### Training Hyperparameters | |
| * **Learning rate:** 1e-5 | |
| * **Epochs:** 1 | |
| * **Loss:** MultipleNegativesRankingLoss | |
| --- | |
| ## Integrations | |
| ### π¦ Owl-CLI β Semantic Code Search in Your Terminal | |
| > **Repository:** [https://github.com/Shun0212/Owl-CLI](https://github.com/Shun0212/Owl-CLI) | |
| **Owl-ph2-len2048** is the embedding backbone of **[Owl-CLI](https://github.com/Shun0212/Owl-CLI)**, a command-line tool for semantic code search powered by dense retrieval. | |
| Owl-CLI indexes your codebase at the **function level**, encodes each function using this model, and performs **vector similarity search** to find relevant code for natural language queries β directly from your terminal. | |
| #### Key Features | |
| | Feature | Description | | |
| |---|---| | |
| | Semantic search | Natural language β relevant functions via dense embeddings | | |
| | Function-level indexing | Indexed with file paths and line numbers | | |
| | Differential cache | Only re-embeds changed files | | |
| | JSON output | Easy integration with other tools and scripts | | |
| | MCP server support | Plug into AI coding agents (e.g., Claude Code, Cursor) | | |
| #### Example: Query Routing | |
|  | |
| #### Example: Interactive Session | |
|  | |
| #### Quick Start | |
| ```bash | |
| # Install | |
| git clone https://github.com/Shun0212/Owl-CLI.git | |
| # Index your codebase and search | |
| owl search "function that handles authentication" | |
| # JSON output for tool integration | |
| owl search "parse config file" --json | |
| # Start MCP server for AI agent integration | |
| owl mcp | |
| ``` | |
| For full documentation and installation instructions, see the [Owl-CLI repository](https://github.com/Shun0212/Owl-CLI). |