---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- generated_from_trainer
- dataset_size:800640
- loss:MultipleNegativesRankingLoss
base_model: Shuu12121/Owl-ph2-base-len2048
pipeline_tag: sentence-similarity
library_name: sentence-transformers
---

# Shuu12121/Owl-ph2-len2048 🦉

## Model Details

### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [Shuu12121/Owl-ph2-base-len2048](https://huggingface.co/Shuu12121/Owl-ph2-base-len2048)
- **Maximum Sequence Length:** 1024 tokens (2048 tokens during pretraining)
- **Output Dimensionality:** 768
- **Similarity Function:** Cosine Similarity

This model is a SentenceTransformer variant of **Shuu12121/Owl-ph2-base-len2048**.
It was trained on the **Owl corpus** for **code search** and **code-text retrieval**.
The training data consists of roughly **100,000 samples per language** (**800,640 pairs** in total), and the model was trained for **1 epoch** with a **learning rate of 1e-5**.

### Model Sources

- **Base model:** [Shuu12121/Owl-ph2-base-len2048](https://huggingface.co/Shuu12121/Owl-ph2-base-len2048)
- **Sentence Transformers:** [Sentence Transformers Documentation](https://sbert.net)

### Full Model Architecture

```text
SentenceTransformer(
  (0): Transformer({'max_seq_length': 1024, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
````

## Intended Uses

This model is intended for:

* code search
* code-text retrieval
* semantic similarity
* dense embedding generation for source code and natural language

## Usage

### Direct Usage (Sentence Transformers)

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Shuu12121/Owl-ph2-len2048")
```

## Training Details

### Training Dataset

This model was trained on the [**Owl corpus**](https://huggingface.co/collections/Shuu12121/codesearch-datasets), a dataset constructed for code search and code-text retrieval.
The training set contains approximately **100,000 samples per language**, resulting in **800,640 training pairs** in total.

### Training Hyperparameters

* **Learning rate:** 1e-5
* **Epochs:** 1
* **Loss:** MultipleNegativesRankingLoss

## Integrations

### Owl-CLI

This model is used as the embedding model in **[Owl-CLI](https://github.com/Shun0212/Owl-CLI)**, a command-line tool for semantic code search.

Owl-CLI indexes source code at the **function level**, generates dense embeddings using this model, and performs **vector similarity search** to retrieve relevant code for natural language queries.

Key features of Owl-CLI include:

- **Semantic code search** using dense embeddings
- **Function-level indexing** with file paths and line numbers
- **Automatic indexing** on first search
- **Differential embedding cache** to avoid re-embedding unchanged files
- **JSON output** for tool integration
- **MCP server support** for integration with AI coding agents (e.g., Claude Code)

Repository:  
https://github.com/Shun0212/Owl-CLI