Owl-ph2-len2048 / README.md
Shuu12121's picture
Update README.md
ecbdebb verified
---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- generated_from_trainer
- dataset_size:800640
- loss:MultipleNegativesRankingLoss
base_model: Shuu12121/Owl-ph2-base-len2048
pipeline_tag: sentence-similarity
library_name: sentence-transformers
---
# Shuu12121/Owl-ph2-len2048 πŸ¦‰
```
β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•— β–ˆβ–ˆβ•—β–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•— β–ˆβ–ˆβ•—
β–ˆβ–ˆβ•”β•β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•”β•β•β•β•β• β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ ,______,
β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘ β–ˆβ•— β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ ( O v O )
β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘ β•šβ•β•β•β•β•β• β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ / V \
β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β•šβ–ˆβ–ˆβ–ˆβ•”β–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•‘ /( )\
β•šβ•β•β•β•β•β• β•šβ•β•β•β•šβ•β•β• β•šβ•β•β•β•β•β•β• β•šβ•β•β•β•β•β• β•šβ•β•β•β•β•β•β• β•šβ•β• ^^ ^^
```
## Model Details
### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [Shuu12121/Owl-ph2-base-len2048](https://huggingface.co/Shuu12121/Owl-ph2-base-len2048)
- **Maximum Sequence Length:** 1024 tokens (2048 tokens during pretraining)
- **Output Dimensionality:** 768
- **Similarity Function:** Cosine Similarity
This model is a SentenceTransformer variant of **Shuu12121/Owl-ph2-base-len2048**.
It was trained on the **Owl corpus** for **code search** and **code-text retrieval**.
The training data consists of roughly **100,000 samples per language** (**800,640 pairs** in total), and the model was trained for **1 epoch** with a **learning rate of 1e-5**.
### Model Sources
- **Base model:** [Shuu12121/Owl-ph2-base-len2048](https://huggingface.co/Shuu12121/Owl-ph2-base-len2048)
- **Sentence Transformers:** [Sentence Transformers Documentation](https://sbert.net)
### Full Model Architecture
```text
SentenceTransformer(
(0): Transformer({'max_seq_length': 1024, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```
## Intended Uses
This model is intended for:
* code search
* code-text retrieval
* semantic similarity
* dense embedding generation for source code and natural language
## Usage
### Direct Usage (Sentence Transformers)
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Shuu12121/Owl-ph2-len2048")
```
## Training Details
### Training Dataset
This model was trained on the [**Owl corpus**](https://huggingface.co/collections/Shuu12121/codesearch-datasets), a dataset constructed for code search and code-text retrieval.
The training set contains approximately **100,000 samples per language**, resulting in **800,640 training pairs** in total.
### Training Hyperparameters
* **Learning rate:** 1e-5
* **Epochs:** 1
* **Loss:** MultipleNegativesRankingLoss
---
## Integrations
### πŸ¦‰ Owl-CLI β€” Semantic Code Search in Your Terminal
> **Repository:** [https://github.com/Shun0212/Owl-CLI](https://github.com/Shun0212/Owl-CLI)
**Owl-ph2-len2048** is the embedding backbone of **[Owl-CLI](https://github.com/Shun0212/Owl-CLI)**, a command-line tool for semantic code search powered by dense retrieval.
Owl-CLI indexes your codebase at the **function level**, encodes each function using this model, and performs **vector similarity search** to find relevant code for natural language queries β€” directly from your terminal.
#### Key Features
| Feature | Description |
|---|---|
| Semantic search | Natural language β†’ relevant functions via dense embeddings |
| Function-level indexing | Indexed with file paths and line numbers |
| Differential cache | Only re-embeds changed files |
| JSON output | Easy integration with other tools and scripts |
| MCP server support | Plug into AI coding agents (e.g., Claude Code, Cursor) |
#### Example: Query Routing
![example-routing](https://raw.githubusercontent.com/Shun0212/Owl-CLI/main/docs/images/example-routing.png)
#### Example: Interactive Session
![example-session](https://raw.githubusercontent.com/Shun0212/Owl-CLI/main/docs/images/example-session.png)
#### Quick Start
```bash
# Install
git clone https://github.com/Shun0212/Owl-CLI.git
# Index your codebase and search
owl search "function that handles authentication"
# JSON output for tool integration
owl search "parse config file" --json
# Start MCP server for AI agent integration
owl mcp
```
For full documentation and installation instructions, see the [Owl-CLI repository](https://github.com/Shun0212/Owl-CLI).