scale_gsm8k_qwen3.5-4b

GRPO experiment from TinkerRL-Bench world-class experiment suite.

Training Details

Base model: Qwen/Qwen3.5-4B
Method: GRPO (Group Relative Policy Optimization)
Platform: Tinker API v0.18.1
Task: gsm8k
Seed: 42
LoRA rank: 32
Learning rate: 3e-05
Group size: 8
Steps: 30

Results

First-5 avg reward: 96.2%
Last-10 avg reward: 85.0%
Peak reward: 100.0%
Zero-loss steps: 57%
Tinker Run ID: 200e4983-3e9d-5176-87d0-c55dd3c73539:train:0

Reward Trace

[
  1.0,
  0.8125,
  1.0,
  1.0,
  1.0,
  0.5,
  0.5,
  0.875,
  0.625,
  0.8125,
  1.0,
  1.0,
  1.0,
  0.625,
  0.875,
  0.25,
  1.0,
  0.875,
  0.6875,
  0.5625,
  0.9375,
  0.5,
  0.9375,
  0.5,
  1.0,
  1.0,
  0.625,
  1.0,
  1.0,
  1.0
]

Citation

@misc{tinker-rl-bench-2026,
  title={TinkerRL-Bench: A Unified Benchmark for RL Post-Training},
  author={Arvind C R and Sandhya Jeyaraj and Madhu Kumara L and Mohammad Rafi and Dhruva N Murthy and Arumugam K},
  year={2026},
  url={https://github.com/arvindcr4/tinker-rl-lab}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Model tree for arvindcr4/tinker-rl-bench-scale_gsm8k_qwen3.5-4b

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Finetuned

(304)

this model

arvindcr4
/

tinker-rl-bench-scale_gsm8k_qwen3.5-4b

scale_gsm8k_qwen3.5-4b

Training Details

Results

Reward Trace

Citation

Model tree for arvindcr4/tinker-rl-bench-scale_gsm8k_qwen3.5-4b

Dataset used to train arvindcr4/tinker-rl-bench-scale_gsm8k_qwen3.5-4b