---
license: mit
base_model: Qwen/Qwen2.5-0.5B
library_name: peft
tags:
  - process-reward-model
  - reasoning
  - reward-model
  - lora
  - test-time-compute
  - ai-efficiency
pipeline_tag: text-classification
datasets:
  - vanthienha199/thinktank-reward-pairs
  - vanthienha199/thinktank-reasoning-labels
---

# ThinkTank PRM — Process Reward Model for Reasoning Efficiency

**A reward model that scores reasoning steps as useful or wasteful.**

Trained on crowdsourced human judgments from [ThinkTank](https://thinktanklab.quest), a Game With A Purpose where players identify wasteful steps in AI reasoning chains.

## Results

| Metric | Value |
|--------|-------|
| **Pairwise accuracy** | **95.7%** |
| Eval loss | 0.071 |
| Training pairs | 92 |
| Eval pairs | 23 |
| Training time | 105 seconds |

## Usage

```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PeftModel

# Load
tokenizer = AutoTokenizer.from_pretrained("vanthienha199/thinktank-prm-qwen2.5-0.5b")
base = AutoModelForSequenceClassification.from_pretrained("Qwen/Qwen2.5-0.5B", num_labels=1)
model = PeftModel.from_pretrained(base, "vanthienha199/thinktank-prm-qwen2.5-0.5b")
model.eval()

# Score a reasoning step
text = "Question: What is 25% of 200?\n\nReasoning step (step 3, calculation): 25% = 0.25. 0.25 * 200 = 50."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
    score = model(**inputs).logits.item()

print(f"Score: {score:.3f}")  # Positive = useful, negative = wasteful
```

## Example Scores

| Step Type | Content | Score | Label |
|-----------|---------|-------|-------|
| thinking | "I need to find 25% of 200..." | -0.33 | WASTEFUL |
| calculation | "25/100 = 0.25. 0.25 * 200 = 50" | +3.21 | USEFUL |
| conclusion | "The answer is 50" | +3.25 | USEFUL |
| verification | "Let me double-check: 200/4 = 50" | +1.08 | USEFUL |

## Training Details

- **Base model:** Qwen/Qwen2.5-0.5B
- **Method:** LoRA (r=16, alpha=32, dropout=0.1)
- **Target modules:** q_proj, v_proj + score head
- **Epochs:** 5
- **Learning rate:** 1e-4
- **Hardware:** Apple M4 (MPS), 105 seconds total

## The Pipeline

```
ThinkTank GWAP (19 users, 206 judgments)
    → Consensus labels (165 steps)
    → Reward pairs (115 chosen/rejected)
    → This PRM (95.7% accuracy)
    → Score any LLM reasoning chain
```

## Links

- [ThinkTank Game](https://thinktanklab.quest) — Play and contribute labels
- [Step Labels Dataset](https://huggingface.co/datasets/vanthienha199/thinktank-reasoning-labels)
- [Reward Pairs Dataset](https://huggingface.co/datasets/vanthienha199/thinktank-reward-pairs)

## Citation

```bibtex
@misc{thinktank-prm-2026,
  title={ThinkTank PRM: A Process Reward Model Trained on Crowdsourced Reasoning Labels},
  author={Ha Le},
  year={2026},
  url={https://huggingface.co/vanthienha199/thinktank-prm-qwen2.5-0.5b}
}
```