vanthienha199/thinktank-reward-pairs
Viewer β’ Updated β’ 115 β’ 46
How to use vanthienha199/thinktank-prm-qwen2.5-0.5b with PEFT:
from peft import PeftModel
from transformers import AutoModelForSequenceClassification
base_model = AutoModelForSequenceClassification.from_pretrained("Qwen/Qwen2.5-0.5B")
model = PeftModel.from_pretrained(base_model, "vanthienha199/thinktank-prm-qwen2.5-0.5b")A reward model that scores reasoning steps as useful or wasteful.
Trained on crowdsourced human judgments from ThinkTank, a Game With A Purpose where players identify wasteful steps in AI reasoning chains.
| Metric | Value |
|---|---|
| Pairwise accuracy | 95.7% |
| Eval loss | 0.071 |
| Training pairs | 92 |
| Eval pairs | 23 |
| Training time | 105 seconds |
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PeftModel
# Load
tokenizer = AutoTokenizer.from_pretrained("vanthienha199/thinktank-prm-qwen2.5-0.5b")
base = AutoModelForSequenceClassification.from_pretrained("Qwen/Qwen2.5-0.5B", num_labels=1)
model = PeftModel.from_pretrained(base, "vanthienha199/thinktank-prm-qwen2.5-0.5b")
model.eval()
# Score a reasoning step
text = "Question: What is 25% of 200?\n\nReasoning step (step 3, calculation): 25% = 0.25. 0.25 * 200 = 50."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
score = model(**inputs).logits.item()
print(f"Score: {score:.3f}") # Positive = useful, negative = wasteful
| Step Type | Content | Score | Label |
|---|---|---|---|
| thinking | "I need to find 25% of 200..." | -0.33 | WASTEFUL |
| calculation | "25/100 = 0.25. 0.25 * 200 = 50" | +3.21 | USEFUL |
| conclusion | "The answer is 50" | +3.25 | USEFUL |
| verification | "Let me double-check: 200/4 = 50" | +1.08 | USEFUL |
ThinkTank GWAP (19 users, 206 judgments)
β Consensus labels (165 steps)
β Reward pairs (115 chosen/rejected)
β This PRM (95.7% accuracy)
β Score any LLM reasoning chain
@misc{thinktank-prm-2026,
title={ThinkTank PRM: A Process Reward Model Trained on Crowdsourced Reasoning Labels},
author={Ha Le},
year={2026},
url={https://huggingface.co/vanthienha199/thinktank-prm-qwen2.5-0.5b}
}
Base model
Qwen/Qwen2.5-0.5B