--- license: mit base_model: Qwen/Qwen2.5-0.5B library_name: peft tags: - process-reward-model - reasoning - reward-model - lora - test-time-compute - ai-efficiency pipeline_tag: text-classification datasets: - vanthienha199/thinktank-reward-pairs - vanthienha199/thinktank-reasoning-labels --- # ThinkTank PRM — Process Reward Model for Reasoning Efficiency **A reward model that scores reasoning steps as useful or wasteful.** Trained on crowdsourced human judgments from [ThinkTank](https://thinktanklab.quest), a Game With A Purpose where players identify wasteful steps in AI reasoning chains. ## Results | Metric | Value | |--------|-------| | **Pairwise accuracy** | **95.7%** | | Eval loss | 0.071 | | Training pairs | 92 | | Eval pairs | 23 | | Training time | 105 seconds | ## Usage ```python import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer from peft import PeftModel # Load tokenizer = AutoTokenizer.from_pretrained("vanthienha199/thinktank-prm-qwen2.5-0.5b") base = AutoModelForSequenceClassification.from_pretrained("Qwen/Qwen2.5-0.5B", num_labels=1) model = PeftModel.from_pretrained(base, "vanthienha199/thinktank-prm-qwen2.5-0.5b") model.eval() # Score a reasoning step text = "Question: What is 25% of 200?\n\nReasoning step (step 3, calculation): 25% = 0.25. 0.25 * 200 = 50." inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) with torch.no_grad(): score = model(**inputs).logits.item() print(f"Score: {score:.3f}") # Positive = useful, negative = wasteful ``` ## Example Scores | Step Type | Content | Score | Label | |-----------|---------|-------|-------| | thinking | "I need to find 25% of 200..." | -0.33 | WASTEFUL | | calculation | "25/100 = 0.25. 0.25 * 200 = 50" | +3.21 | USEFUL | | conclusion | "The answer is 50" | +3.25 | USEFUL | | verification | "Let me double-check: 200/4 = 50" | +1.08 | USEFUL | ## Training Details - **Base model:** Qwen/Qwen2.5-0.5B - **Method:** LoRA (r=16, alpha=32, dropout=0.1) - **Target modules:** q_proj, v_proj + score head - **Epochs:** 5 - **Learning rate:** 1e-4 - **Hardware:** Apple M4 (MPS), 105 seconds total ## The Pipeline ``` ThinkTank GWAP (19 users, 206 judgments) → Consensus labels (165 steps) → Reward pairs (115 chosen/rejected) → This PRM (95.7% accuracy) → Score any LLM reasoning chain ``` ## Links - [ThinkTank Game](https://thinktanklab.quest) — Play and contribute labels - [Step Labels Dataset](https://huggingface.co/datasets/vanthienha199/thinktank-reasoning-labels) - [Reward Pairs Dataset](https://huggingface.co/datasets/vanthienha199/thinktank-reward-pairs) ## Citation ```bibtex @misc{thinktank-prm-2026, title={ThinkTank PRM: A Process Reward Model Trained on Crowdsourced Reasoning Labels}, author={Ha Le}, year={2026}, url={https://huggingface.co/vanthienha199/thinktank-prm-qwen2.5-0.5b} } ```