--- library_name: transformers datasets: - MBZUAI-Paris/short-maths-verl-rlhf-filtered-134400-samples-traces - MBZUAI-Paris/Frugal-Thinking-RL-Data language: - en - ar - fr - es - pt metrics: - accuracy license: apache-2.0 --- # Frugal-Thinking-30B-A3B: Easy Samples as Length Regularizers in Math RLVR **Paper**: Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR **Project page**: https://mbzuai-paris.github.io/Frugal-Thinking **Code** is publicly available on [Github](https://github.com/MBZUAI-Paris/Frugal-Thinking). **Base Model**: [Qwen/Qwen3-30B-A3B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507) **Authors**: [Abdelaziz Bounhar](https://scholar.google.com/citations?user=MeyZL18AAAAJ&hl=fr) et al. **License**: Apache 2.0

Reasoning Performance Evaluation

Test-Time Scaling — AIME25 Accuracy

# Overview Frugal-Thinking-30B-A3B is a reasoning-optimized variant of Qwen3-30B-A3B-Thinking-2507 trained via **Reinforcement Learning with Verifiable Rewards (RLVR)** on the [Frugal-Thinking dataset](https://huggingface.co/datasets/MBZUAI-Paris/Frugal-Thinking-RL-Data). It introduces **emergent brevity**: t**he model learns to reason efficiently and generate concise, verifiable mathematical solutions—without any explicit length penalty**. By retaining moderately easy problems during training, Frugal-Thinking implicitly regularizes reasoning length, reducing verbosity while preserving accuracy. # Training Setup | Parameter | Value | |:-----------|:------| | **Algorithm** | Group Relative Policy Optimization (GRPO) | | **Reward function** | Verifiable binary reward (exact match of boxed answer) | | **Context length** | 16k tokens | | **Batch size** | 128 | | **Group size (G)** | 16 | | **Learning rate** | 1e-6 | | **Compute** | 250 H200 GPU-days | | **Framework** | [verl](https://github.com/volcengine/verl) | # Training Stages | Stage | Objective | Source | #Samples | Description | |:------|:---------|:--------|:----------|:-------------| | **Stage 1 – Emergent Brevity** | Implicit length regularization | Internal curated mix of math datasets | 14.2 k | Moderately easy verifiable math problems encourage concise reasoning. | | **Stage 2 – Curriculum RLVR** | Progressive learning on harder problems | Filtered subset of [DeepMath-103k](https://huggingface.co/datasets/ChuGyouk/DeepMath-Filtered-59.9K) | 14.5 k | Gradually harder math problems to improve reasoning depth and coverage. | # Performance Across Benchmarks **Evaluation metrics**: Pass@1 (%) and Efficiency-Adjusted Accuracy **Max generation length**: 42k tokens ## Definition: Efficiency-Adjusted Accuracy (EAA) To compare models jointly on accuracy and brevity, we introduce a new metric named **Efficiency-Adjusted Accuracy** (EAA). EAA penalizes unnecessarily long reasoning chains: $\text{EAA}\gamma = a \times \exp\!\left[-\gamma \cdot \frac{L - L{\min}}{L_{\max} - L_{\min}}\right]$ where a is accuracy, $L$ is average output length, and $γ$ controls how strongly long outputs are penalized ($γ$ = 3 in our experiments). Higher EAA means the model solves tasks efficiently, with fewer tokens for similar accuracy. ## Results on Reasoning Benchmarks (42k-token decoding budget) **Format:** `pass@1 | EAA₃` *(IFEval reports average accuracy instead of pass@1)* | Model | Size | GPQA Diamond | AIME25 | Omni-Hard | GSM Plus | IFEval | MATH-500 | **Average** | |:--|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| | [Qwen3-30B-A3B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507) | 30B | **70.71 \| 43.96** | **86.67 \| 13.93** | 08.09 \| 00.63 | 90.29 \| **90.29** | 41.35 \| **41.35** | **97.80 \| 62.73** | 65.82 \| 42.15 | | [Magistral-Small-2509](https://huggingface.co/mistralai/Magistral-Small-2509) | 24B | 62.63 \| **62.63** | 80.00 \| 20.71 | **53.18 \| 11.41** | 88.86 \| 86.42 | 39.71 \| 30.77 | 96.60 \| 81.77 | 70.16 \| 48.95 | | [Magistral-Small-2507](https://huggingface.co/mistralai/Magistral-Small-2507) | 24B | 57.07 \| 02.84 | 53.33 \| 02.66 | 34.10 \| 03.60 | 81.29 \| 04.05 | 41.75 \| 06.76 | 93.20 \| 04.64 | 60.12 \| 04.09 | | [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) | 3B | 27.78 \| 11.55 | 30.00 \| 13.36 | 35.26 \| 14.20 | 83.48 \| 79.15 | **71.21 \| 03.55** | 90.80 \| 80.20 | 56.42 \| 33.67 | | [Phi-4-mini-reasoning](https://huggingface.co/microsoft/Phi-4-mini-reasoning) | 4B | 30.30 \| 14.55 | 40.00 \| 15.41 | 32.37 \| 18.39 | 87.10 \| 85.54 | 51.58 \| 22.05 | 90.80 \| 79.84 | 55.36 \| 39.30 | | [Qwen3-30B-A3B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507) | 4B | 67.17 \| 28.48 | 73.33 \| 05.93 | 04.62 \| 00.23 | 89.05 \| 81.77 | 38.57 \| 20.79 | 97.60 \| 57.08 | 61.72 \| 32.38 | | — | — | — | — | — | — | — | — | — | | **Frugal-Thinking-30B-A3B-Stage-1 (ours)** | 30B | 70.20 \| 39.14 | 83.33 \| 15.41 | 06.94 \| 00.72 | 90.47 \| 87.79 | 41.65 \| 40.54 | 97.20 \| 73.26 | 64.97 \| 42.80 | | **[Frugal-Thinking-30B-A3B-Stage-2](https://huggingface.co/MBZUAI-Paris/Frugal-Thinking-30B-A3B) (ours)** | 30B | 65.65 \| 33.17 | **86.67 \| 44.60** | 46.24 \| 21.62 | **90.57 \| 75.55** | 42.07 \| 36.92 | 97.40 \| 88.78 | **71.43 \| 50.11** | | **Frugal-Thinking-4B-Stage-1 (ours)** | 4B | 63.64 \| 42.21 | 60.00 \| 46.02 | 35.84 \| 31.54 | 89.24 \| 76.59 | 39.91 \| 22.43 | 95.00 \| 86.30 | 63.94 \| 50.85 | | **[Frugal-Thinking-4B-Stage-2](https://huggingface.co/MBZUAI-Paris/Frugal-Thinking-4B) (ours)** | 4B | **70.20 \| 53.84** | 70.00 \| **70.00** | **47.40 \| 47.40** | 89.00 \| 80.06 | 39.49 \| 23.20 | 95.20 \| **95.20** | 68.55 \| **61.22** | ## Average Reasoning Length | Model | Size | Avg Output Length (tokens) | |:--|:--:|:--:| | Qwen3-30B-A3B-Thinking-2507 | 30B | 9 946 | | Magistral-Small-2509 | 24B | 8 100 | | Magistral-Small-2507 | 24B | 17 116 | | SmolLM3-3B | 3B | 8 338 | | Phi-4-mini-reasoning | 4B | 7 458 | | Qwen3-30B-A3B-Thinking-2507 | 4B | 11 491 | | **Frugal-Thinking-30B-A3B-Stage-1 (ours)** | 30B | 9 537 | | **Frugal-Thinking-30B-A3B-Stage-2 (ours)** | 30B | **7 326** | | **Frugal-Thinking-4B-Stage-1 (ours)** | 4B | 6 270 | | **Frugal-Thinking-4B-Stage-2 (ours)** | 4B | **5 712** | ## Conclusions ➡️ Frugal-Thinking-30B-A3B-Stage 2 outperforms all 4B-class baselines in both accuracy and efficiency, achieving similar performance to the 30B MoE model, and better on average. ➡️ ≈ 50–60 % reduction in reasoning length while preserving or improving performance. ## Intended Use This model is designed for **reasoning-intensive workloads**, with a particular focus on controllable efficiency and verifiability: - **Verifiable mathematical reasoning** and competition-style problem solving (e.g., AIME, GSM, MATH, GPQA) - **Reasoning** under constrained or variable test-time compute budgets - **Efficiency–accuracy trade-off analysis** in RLHF / RLVR and test-time scaling studies - **Reasoning length control and compression**, enabling shorter, more cost-efficient chains of thought - **Benchmarking reasoning robustness** across context lengths and difficulty regimes - **Research on test-time compute allocation**, adaptive reasoning, and frugal inference ⚠️ **Not intended for** (but could be used): - Open-ended creative writing or stylistic generation - Conversational agents requiring rich persona or emotional interaction # Citation If you use this model, please cite: ```bibtex @misc{bounhar2025frugalmath, title={Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR}, author={Bounhar, Abdelaziz et al.}, year={2025}, journal={arXiv preprint arXiv:2511.01937} } ```