vanthienha199
/

thinktank-prm-qwen2.5-0.5b

@@ -1,61 +1,99 @@
 ---
 base_model: Qwen/Qwen2.5-0.5B
 library_name: peft
-model_name: thinktank-prm
 tags:
-- base_model:adapter:Qwen/Qwen2.5-0.5B
-- lora
-- reward-trainer
-- transformers
-- trl
-licence: license
 ---
-# Model Card for thinktank-prm
-This model is a fine-tuned version of [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B).
-It has been trained using [TRL](https://github.com/huggingface/trl).
-## Quick start
-```python
-from transformers import pipeline
-text = "The capital of France is Paris."
-rewarder = pipeline(model="None", device="cuda")
-output = rewarder(text)[0]
-print(output["score"])
-```
-## Training procedure
-This model was trained with Reward.
-### Framework versions
-- PEFT 0.18.1
-- TRL: 0.29.1
-- Transformers: 5.3.0
-- Pytorch: 2.10.0
-- Datasets: 4.8.3
-- Tokenizers: 0.22.2
-## Citations
-Cite TRL as:
 ```bibtex
-@software{vonwerra2020trl,
-  title   = {{TRL: Transformers Reinforcement Learning}},
-  author  = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
-  license = {Apache-2.0},
-  url     = {https://github.com/huggingface/trl},
-  year    = {2020}
 }
-```

 ---
+license: mit
 base_model: Qwen/Qwen2.5-0.5B
 library_name: peft
 tags:
+  - process-reward-model
+  - reasoning
+  - reward-model
+  - lora
+  - test-time-compute
+  - ai-efficiency
+pipeline_tag: text-classification
+datasets:
+  - vanthienha199/thinktank-reward-pairs
+  - vanthienha199/thinktank-reasoning-labels
 ---
+# ThinkTank PRM — Process Reward Model for Reasoning Efficiency
+**A reward model that scores reasoning steps as useful or wasteful.**
+Trained on crowdsourced human judgments from [ThinkTank](https://thinktanklab.quest), a Game With A Purpose where players identify wasteful steps in AI reasoning chains.
+## Results
+| Metric | Value |
+|--------|-------|
+| **Pairwise accuracy** | **95.7%** |
+| Eval loss | 0.071 |
+| Training pairs | 92 |
+| Eval pairs | 23 |
+| Training time | 105 seconds |
+## Usage
+```python
+import torch
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+from peft import PeftModel
+# Load
+tokenizer = AutoTokenizer.from_pretrained("vanthienha199/thinktank-prm-qwen2.5-0.5b")
+base = AutoModelForSequenceClassification.from_pretrained("Qwen/Qwen2.5-0.5B", num_labels=1)
+model = PeftModel.from_pretrained(base, "vanthienha199/thinktank-prm-qwen2.5-0.5b")
+model.eval()
+# Score a reasoning step
+text = "Question: What is 25% of 200?\n\nReasoning step (step 3, calculation): 25% = 0.25. 0.25 * 200 = 50."
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
+with torch.no_grad():
+    score = model(**inputs).logits.item()
+print(f"Score: {score:.3f}")  # Positive = useful, negative = wasteful
+```
+## Example Scores
+| Step Type | Content | Score | Label |
+|-----------|---------|-------|-------|
+| thinking | "I need to find 25% of 200..." | -0.33 | WASTEFUL |
+| calculation | "25/100 = 0.25. 0.25 * 200 = 50" | +3.21 | USEFUL |
+| conclusion | "The answer is 50" | +3.25 | USEFUL |
+| verification | "Let me double-check: 200/4 = 50" | +1.08 | USEFUL |
+## Training Details
+- **Base model:** Qwen/Qwen2.5-0.5B
+- **Method:** LoRA (r=16, alpha=32, dropout=0.1)
+- **Target modules:** q_proj, v_proj + score head
+- **Epochs:** 5
+- **Learning rate:** 1e-4
+- **Hardware:** Apple M4 (MPS), 105 seconds total
+## The Pipeline
+```
+ThinkTank GWAP (19 users, 206 judgments)
+    → Consensus labels (165 steps)
+    → Reward pairs (115 chosen/rejected)
+    → This PRM (95.7% accuracy)
+    → Score any LLM reasoning chain
+```
+## Links
+- [ThinkTank Game](https://thinktanklab.quest) — Play and contribute labels
+- [Step Labels Dataset](https://huggingface.co/datasets/vanthienha199/thinktank-reasoning-labels)
+- [Reward Pairs Dataset](https://huggingface.co/datasets/vanthienha199/thinktank-reward-pairs)
+## Citation
 ```bibtex
+@misc{thinktank-prm-2026,
+  title={ThinkTank PRM: A Process Reward Model Trained on Crowdsourced Reasoning Labels},
+  author={Ha Le},
+  year={2026},
+  url={https://huggingface.co/vanthienha199/thinktank-prm-qwen2.5-0.5b}
 }
+```