Owl-ph2-len2048 / README.md

Update README.md

ecbdebb verified 3 months ago

5.05 kB

	---
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- dense
	- generated_from_trainer
	- dataset_size:800640
	- loss:MultipleNegativesRankingLoss
	base_model: Shuu12121/Owl-ph2-base-len2048
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	---

	# Shuu12121/Owl-ph2-len2048 🦉

	```
	██████╗ ██╗ ██╗██╗ ██████╗ ██╗ ██╗
	██╔═══██╗██║ ██║██║ ██╔════╝ ██║ ██║ ,______,
	██║ ██║██║ █╗ ██║██║ ██████╗ ██║ ██║ ██║ ( O v O )
	██║ ██║██║███╗██║██║ ╚═════╝ ██║ ██║ ██║ / V \
	╚██████╔╝╚███╔███╔╝███████╗ ╚██████╗ ███████╗ ██║ /( )\
	╚═════╝ ╚══╝╚══╝ ╚══════╝ ╚═════╝ ╚══════╝ ╚═╝ ^^ ^^
	```

	## Model Details

	### Model Description
	- Model Type: Sentence Transformer
	- Base model: [Shuu12121/Owl-ph2-base-len2048](https://huggingface.co/Shuu12121/Owl-ph2-base-len2048)
	- Maximum Sequence Length: 1024 tokens (2048 tokens during pretraining)
	- Output Dimensionality: 768
	- Similarity Function: Cosine Similarity

	This model is a SentenceTransformer variant of Shuu12121/Owl-ph2-base-len2048.
	It was trained on the Owl corpus for code search and code-text retrieval.
	The training data consists of roughly 100,000 samples per language (800,640 pairs in total), and the model was trained for 1 epoch with a learning rate of 1e-5.

	### Model Sources

	- Base model: [Shuu12121/Owl-ph2-base-len2048](https://huggingface.co/Shuu12121/Owl-ph2-base-len2048)
	- Sentence Transformers: [Sentence Transformers Documentation](https://sbert.net)

	### Full Model Architecture

	```text
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 1024, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
	(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
	)
	```

	## Intended Uses

	This model is intended for:

	* code search
	* code-text retrieval
	* semantic similarity
	* dense embedding generation for source code and natural language

	## Usage

	### Direct Usage (Sentence Transformers)

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("Shuu12121/Owl-ph2-len2048")
	```

	## Training Details

	### Training Dataset

	This model was trained on the [Owl corpus](https://huggingface.co/collections/Shuu12121/codesearch-datasets), a dataset constructed for code search and code-text retrieval.
	The training set contains approximately 100,000 samples per language, resulting in 800,640 training pairs in total.

	### Training Hyperparameters

	* Learning rate: 1e-5
	* Epochs: 1
	* Loss: MultipleNegativesRankingLoss

	---

	## Integrations

	### 🦉 Owl-CLI — Semantic Code Search in Your Terminal

	> Repository: [https://github.com/Shun0212/Owl-CLI](https://github.com/Shun0212/Owl-CLI)

	Owl-ph2-len2048 is the embedding backbone of [Owl-CLI](https://github.com/Shun0212/Owl-CLI), a command-line tool for semantic code search powered by dense retrieval.

	Owl-CLI indexes your codebase at the function level, encodes each function using this model, and performs vector similarity search to find relevant code for natural language queries — directly from your terminal.

	#### Key Features

	\| Feature \| Description \|
	\|---\|---\|
	\| Semantic search \| Natural language → relevant functions via dense embeddings \|
	\| Function-level indexing \| Indexed with file paths and line numbers \|
	\| Differential cache \| Only re-embeds changed files \|
	\| JSON output \| Easy integration with other tools and scripts \|
	\| MCP server support \| Plug into AI coding agents (e.g., Claude Code, Cursor) \|

	#### Example: Query Routing

	![example-routing](https://raw.githubusercontent.com/Shun0212/Owl-CLI/main/docs/images/example-routing.png)

	#### Example: Interactive Session

	![example-session](https://raw.githubusercontent.com/Shun0212/Owl-CLI/main/docs/images/example-session.png)

	#### Quick Start

	```bash
	# Install
	git clone https://github.com/Shun0212/Owl-CLI.git

	# Index your codebase and search
	owl search "function that handles authentication"

	# JSON output for tool integration
	owl search "parse config file" --json

	# Start MCP server for AI agent integration
	owl mcp
	```

	For full documentation and installation instructions, see the [Owl-CLI repository](https://github.com/Shun0212/Owl-CLI).