arch_gsm8k_gpt-oss-20b

GRPO experiment from TinkerRL-Bench world-class experiment suite.

Training Details

  • Base model: openai/gpt-oss-20b
  • Method: GRPO (Group Relative Policy Optimization)
  • Platform: Tinker API v0.18.1
  • Task: gsm8k
  • Seed: 42
  • LoRA rank: 32
  • Learning rate: 3e-05
  • Group size: 8
  • Steps: 30

Results

  • First-5 avg reward: 87.5%
  • Last-10 avg reward: 87.5%
  • Peak reward: 100.0%
  • Zero-loss steps: 53%
  • Tinker Run ID: c465d654-a00f-5821-9c11-0e40f0eecf9f:train:0

Reward Trace

[
  0.625,
  0.8125,
  1.0,
  0.9375,
  1.0,
  0.5625,
  0.6875,
  1.0,
  0.5,
  0.9375,
  1.0,
  0.9375,
  1.0,
  0.5,
  0.9375,
  0.5625,
  1.0,
  0.8125,
  0.625,
  1.0,
  1.0,
  0.5,
  0.9375,
  0.5,
  1.0,
  1.0,
  0.8125,
  1.0,
  1.0,
  1.0
]

Citation

@misc{tinker-rl-bench-2026,
  title={TinkerRL-Bench: A Unified Benchmark for RL Post-Training},
  author={Arvind C R and Sandhya Jeyaraj and Madhu Kumara L and Mohammad Rafi and Dhruva N Murthy and Arumugam K},
  year={2026},
  url={https://github.com/arvindcr4/tinker-rl-lab}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for arvindcr4/tinker-rl-bench-arch_gsm8k_gpt-oss-20b

Finetuned
(535)
this model

Dataset used to train arvindcr4/tinker-rl-bench-arch_gsm8k_gpt-oss-20b