vanthienha199 commited on
Commit
32abb3a
·
verified ·
1 Parent(s): a949a73

Add model card

Browse files
Files changed (1) hide show
  1. README.md +75 -37
README.md CHANGED
@@ -1,61 +1,99 @@
1
  ---
 
2
  base_model: Qwen/Qwen2.5-0.5B
3
  library_name: peft
4
- model_name: thinktank-prm
5
  tags:
6
- - base_model:adapter:Qwen/Qwen2.5-0.5B
7
- - lora
8
- - reward-trainer
9
- - transformers
10
- - trl
11
- licence: license
 
 
 
 
12
  ---
13
 
14
- # Model Card for thinktank-prm
15
 
16
- This model is a fine-tuned version of [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B).
17
- It has been trained using [TRL](https://github.com/huggingface/trl).
18
 
19
- ## Quick start
20
 
21
- ```python
22
- from transformers import pipeline
23
 
24
- text = "The capital of France is Paris."
25
- rewarder = pipeline(model="None", device="cuda")
26
- output = rewarder(text)[0]
27
- print(output["score"])
28
- ```
 
 
29
 
30
- ## Training procedure
31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
 
33
 
 
 
 
 
 
 
34
 
 
35
 
36
- This model was trained with Reward.
 
 
 
 
 
37
 
38
- ### Framework versions
39
 
40
- - PEFT 0.18.1
41
- - TRL: 0.29.1
42
- - Transformers: 5.3.0
43
- - Pytorch: 2.10.0
44
- - Datasets: 4.8.3
45
- - Tokenizers: 0.22.2
 
46
 
47
- ## Citations
48
 
 
 
 
49
 
 
50
 
51
- Cite TRL as:
52
-
53
  ```bibtex
54
- @software{vonwerra2020trl,
55
- title = {{TRL: Transformers Reinforcement Learning}},
56
- author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
57
- license = {Apache-2.0},
58
- url = {https://github.com/huggingface/trl},
59
- year = {2020}
60
  }
61
- ```
 
1
  ---
2
+ license: mit
3
  base_model: Qwen/Qwen2.5-0.5B
4
  library_name: peft
 
5
  tags:
6
+ - process-reward-model
7
+ - reasoning
8
+ - reward-model
9
+ - lora
10
+ - test-time-compute
11
+ - ai-efficiency
12
+ pipeline_tag: text-classification
13
+ datasets:
14
+ - vanthienha199/thinktank-reward-pairs
15
+ - vanthienha199/thinktank-reasoning-labels
16
  ---
17
 
18
+ # ThinkTank PRM — Process Reward Model for Reasoning Efficiency
19
 
20
+ **A reward model that scores reasoning steps as useful or wasteful.**
 
21
 
22
+ Trained on crowdsourced human judgments from [ThinkTank](https://thinktanklab.quest), a Game With A Purpose where players identify wasteful steps in AI reasoning chains.
23
 
24
+ ## Results
 
25
 
26
+ | Metric | Value |
27
+ |--------|-------|
28
+ | **Pairwise accuracy** | **95.7%** |
29
+ | Eval loss | 0.071 |
30
+ | Training pairs | 92 |
31
+ | Eval pairs | 23 |
32
+ | Training time | 105 seconds |
33
 
34
+ ## Usage
35
 
36
+ ```python
37
+ import torch
38
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
39
+ from peft import PeftModel
40
+
41
+ # Load
42
+ tokenizer = AutoTokenizer.from_pretrained("vanthienha199/thinktank-prm-qwen2.5-0.5b")
43
+ base = AutoModelForSequenceClassification.from_pretrained("Qwen/Qwen2.5-0.5B", num_labels=1)
44
+ model = PeftModel.from_pretrained(base, "vanthienha199/thinktank-prm-qwen2.5-0.5b")
45
+ model.eval()
46
+
47
+ # Score a reasoning step
48
+ text = "Question: What is 25% of 200?\n\nReasoning step (step 3, calculation): 25% = 0.25. 0.25 * 200 = 50."
49
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
50
+ with torch.no_grad():
51
+ score = model(**inputs).logits.item()
52
+
53
+ print(f"Score: {score:.3f}") # Positive = useful, negative = wasteful
54
+ ```
55
 
56
+ ## Example Scores
57
 
58
+ | Step Type | Content | Score | Label |
59
+ |-----------|---------|-------|-------|
60
+ | thinking | "I need to find 25% of 200..." | -0.33 | WASTEFUL |
61
+ | calculation | "25/100 = 0.25. 0.25 * 200 = 50" | +3.21 | USEFUL |
62
+ | conclusion | "The answer is 50" | +3.25 | USEFUL |
63
+ | verification | "Let me double-check: 200/4 = 50" | +1.08 | USEFUL |
64
 
65
+ ## Training Details
66
 
67
+ - **Base model:** Qwen/Qwen2.5-0.5B
68
+ - **Method:** LoRA (r=16, alpha=32, dropout=0.1)
69
+ - **Target modules:** q_proj, v_proj + score head
70
+ - **Epochs:** 5
71
+ - **Learning rate:** 1e-4
72
+ - **Hardware:** Apple M4 (MPS), 105 seconds total
73
 
74
+ ## The Pipeline
75
 
76
+ ```
77
+ ThinkTank GWAP (19 users, 206 judgments)
78
+ Consensus labels (165 steps)
79
+ Reward pairs (115 chosen/rejected)
80
+ This PRM (95.7% accuracy)
81
+ Score any LLM reasoning chain
82
+ ```
83
 
84
+ ## Links
85
 
86
+ - [ThinkTank Game](https://thinktanklab.quest) — Play and contribute labels
87
+ - [Step Labels Dataset](https://huggingface.co/datasets/vanthienha199/thinktank-reasoning-labels)
88
+ - [Reward Pairs Dataset](https://huggingface.co/datasets/vanthienha199/thinktank-reward-pairs)
89
 
90
+ ## Citation
91
 
 
 
92
  ```bibtex
93
+ @misc{thinktank-prm-2026,
94
+ title={ThinkTank PRM: A Process Reward Model Trained on Crowdsourced Reasoning Labels},
95
+ author={Ha Le},
96
+ year={2026},
97
+ url={https://huggingface.co/vanthienha199/thinktank-prm-qwen2.5-0.5b}
 
98
  }
99
+ ```