Instructions to use omron-sinicx/DGPO-qwen2.5-0.5b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use omron-sinicx/DGPO-qwen2.5-0.5b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="omron-sinicx/DGPO-qwen2.5-0.5b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("omron-sinicx/DGPO-qwen2.5-0.5b")
model = AutoModelForMultimodalLM.from_pretrained("omron-sinicx/DGPO-qwen2.5-0.5b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use omron-sinicx/DGPO-qwen2.5-0.5b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "omron-sinicx/DGPO-qwen2.5-0.5b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "omron-sinicx/DGPO-qwen2.5-0.5b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/omron-sinicx/DGPO-qwen2.5-0.5b

SGLang

How to use omron-sinicx/DGPO-qwen2.5-0.5b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "omron-sinicx/DGPO-qwen2.5-0.5b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "omron-sinicx/DGPO-qwen2.5-0.5b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "omron-sinicx/DGPO-qwen2.5-0.5b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "omron-sinicx/DGPO-qwen2.5-0.5b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use omron-sinicx/DGPO-qwen2.5-0.5b with Docker Model Runner:
```
docker model run hf.co/omron-sinicx/DGPO-qwen2.5-0.5b
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

🧠 Overview

DGPO (Distillation-Guided Policy Optimization) is a reinforcement learning framework with integrated knowledge distillation, designed to enable agentic search behaviors in compact language models.

While RL works well for large models, compact models suffer from:

❌ Poor initial outputs
❌ Training collapse in RL
❌ Ineffective exploration

DGPO solves this by combining:

✅ Cold-start knowledge distillation (KD)
✅ Teacher-guided reinforcement learning

This enables stable learning and even allows compact models to match or surpass teacher models

⚙️ Key Idea

🔁 Distillation-Guided RL

DGPO introduces a simple but powerful principle:

✅ Reward if correct ❌ Mimic teacher if wrong

This creates a stable learning signal even when the model is weak.

🏗️ Framework

1. Cold-Start Initialization (KD)

Train student using teacher-generated outputs (TGO)
Provides high-quality trajectories
Prevents early collapse

2. Distillation-Guided RL

Use PPO-based RL
Reward correct answers
Apply selective KL penalty only when wrong

This enables:

Stable training
Efficient exploration
Error-focused learning

🔍 Agentic RAG Behavior

DGPO trains models to perform multi-step search reasoning:

<think> reasoning </think>

<search> query </search>  
<information> retrieved docs </information>  
<answer> final answer </answer>

🚀 Performance

Overall QA Performance

📊 Qwen2.5 (3B → 0.5B)

Method	NQ	TriviaQA	PopQA	HotpotQA	2Wiki	MuSiQue	Bamboogle	Avg.
Student-0.5B	0.004	0.006	0.007	0.007	0.015	0.000	0.000	0.006
Teacher-3B	0.365	0.569	0.393	0.340	0.368	0.135	0.298	0.353
PPO	0.306	0.444	0.379	0.205	0.218	0.041	0.073	0.238
GKD	0.266	0.408	0.358	0.216	0.217	0.055	0.161	0.240
SeqKD	0.331	0.416	0.364	0.283	0.273	0.089	0.169	0.275
KD	0.331	0.431	0.373	0.286	0.284	0.091	0.290	0.298
DistiLLM	0.333	0.442	0.373	0.288	0.270	0.095	0.209	0.287
TAID	0.325	0.427	0.365	0.290	0.270	0.079	0.218	0.282
DGPO (ours)	0.378	0.481	0.402	0.342	0.303	0.120	0.274	0.329

👉 DGPO achieves ~55× improvement over base model

👉 In some cases, student surpasses teacher

🎓 Citation

@article{kotoge2025dgpo,
title={Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization},
author={Kotoge, Rikuto and Nishimura, Mai and Ma, Jiaxin},
journal={arXiv preprint arXiv:2508.20324},
year={2025}
}

@inproceedings{
kotoge2025democratizing,
title={Democratizing Agentic {RAG}: Distillation-Guided Policy Optimization  for Compact Language Models},
author={Rikuto Kotoge and Mai Nishimura and Jiaxin Ma},
booktitle={NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning},
year={2025},
url={https://openreview.net/forum?id=CP0H9NAWES}
}