Instructions to use rzzhan/ExGRPO-Qwen2.5-7B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use rzzhan/ExGRPO-Qwen2.5-7B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="rzzhan/ExGRPO-Qwen2.5-7B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("rzzhan/ExGRPO-Qwen2.5-7B-Instruct")
model = AutoModelForMultimodalLM.from_pretrained("rzzhan/ExGRPO-Qwen2.5-7B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use rzzhan/ExGRPO-Qwen2.5-7B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "rzzhan/ExGRPO-Qwen2.5-7B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rzzhan/ExGRPO-Qwen2.5-7B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/rzzhan/ExGRPO-Qwen2.5-7B-Instruct

SGLang

How to use rzzhan/ExGRPO-Qwen2.5-7B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "rzzhan/ExGRPO-Qwen2.5-7B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rzzhan/ExGRPO-Qwen2.5-7B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "rzzhan/ExGRPO-Qwen2.5-7B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rzzhan/ExGRPO-Qwen2.5-7B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use rzzhan/ExGRPO-Qwen2.5-7B-Instruct with Docker Model Runner:
```
docker model run hf.co/rzzhan/ExGRPO-Qwen2.5-7B-Instruct
```

ExGRPO-Qwen2.5-7B-Instruct / README.md

rzzhan

Add model card for ExGRPO: Learning to Reason from Experience (#1)

fcaf505 verified 8 months ago

preview code

Raw

History Blame Contribute Delete

3.84 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text-generation
	---

	# ExGRPO: Learning to Reason from Experience

	The model `ExGRPO` was presented in the paper [ExGRPO: Learning to Reason from Experience](https://huggingface.co/papers/2510.02245).

	## Abstract
	Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm
	for improving the reasoning ability of large language models. However, standard
	on-policy training discards rollout experiences after a single update, leading
	to computational inefficiency and instability. While prior work on RL has
	highlighted the benefits of reusing past experience, the role of experience
	characteristics in shaping learning dynamics of large reasoning models remains
	underexplored. In this paper, we are the first to investigate what makes a
	reasoning experience valuable and identify rollout correctness and entropy as
	effective indicators of experience value. Based on these insights, we propose
	ExGRPO (Experiential Group Relative Policy Optimization), a framework that
	organizes and prioritizes valuable experiences, and employs a mixed-policy
	objective to balance exploration with experience exploitation. Experiments on
	five backbone models (1.5B-8B parameters) show that ExGRPO consistently
	improves reasoning performance on mathematical/general benchmarks, with an
	average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO
	stabilizes training on both stronger and weaker models where on-policy methods
	fail. These results highlight principled experience management as a key
	ingredient for efficient and scalable RLVR.

	<div align="center">
	<img src="https://github.com/ElliottYan/LUFFY/raw/main/ExGRPO/figures/exgrpo_intro.png" alt="ExGRPO Overview" style="width: 88%; height: auto;">
	</div>

	## Introduction

	Existing RLVR methods for reasoning tasks predominantly rely on on-policy optimization, which discards online rollouts after a single update, wasting valuable exploration signals and constraining scalability. We conduct a systematic analysis of experience utility in RLVR and identify question difficulty and trajectory entropy as effective online proxies for assessing experience quality. Building on these insights, we propose ExGRPO, a novel framework that strategically manages and replays high-value experiences through bucketed prioritization and mixed-policy optimization, enabling more efficient and stable RLVR training.

	### Key Highlights:
	- Experience Value Modeling: Introduces the online proxy metrics: rollout correctness and trajectory entropy, for quantifying the value of RLVR experience.
	- ExGRPO Framework: Built on top of GRPO, ExGRPO introduces a systematic experience management mechanism and an experience optimization objective to maximize the benefit of past explorations.
	- Generalization and Stability: Demonstrates broad applicability across different backbone models and mitigates training collapse of on-policy RLVR in challenging scenarios.

	## Where to go next

	For more details on installation, data preparation, training, and evaluation, please refer to the [official GitHub repository](https://github.com/ElliottYan/LUFFY/tree/main/ExGRPO).

	A collection of related ExGRPO models is also available on the Hugging Face Hub: [ExGRPO Collection](https://huggingface.co/collections/rzzhan/exgrpo-68d8e302efdfe325187d5c96).

	## Citation

	If you find our model, data, or evaluation code useful, please kindly cite our paper:

	```bib
	@article{zhan2025exgrpo,
	title={ExGRPO: Learning to Reason from Experience},
	author={Runzhe Zhan and Yafu Li and Zhi Wang and Xiaoye Qu and Dongrui Liu and Jing Shao and Derek F. Wong and Yu Cheng},
	year={2025},
	journal = {ArXiv preprint},
	volume = {2510.02245},
	url={https://arxiv.org/abs/2510.02245},
	}
	```