DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Paper β’ 2402.03300 β’ Published β’ 147
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Fine-tuning Qwen 2.5-1.5B to make structured JSON tool calls using SFT and Reinforcement Learning (GRPO).
Train a small LLM to respond with structured tool calls instead of plain text:
{"name": "get_weather", "arguments": {"location": "Paris"}}
Three steps:
| Metric | SFT | GRPO | Winner |
|---|---|---|---|
| JSON Valid | 0% | 92% | GRPO β |
| Correct Tool | 0% | 50% | GRPO β |
| Has Arguments | 0% | 42% | GRPO β |
| Clean Output | 0% | 92% | GRPO β |
| Avg Quality Score | 0.0 | 0.59 | GRPO β |
Key finding: SFT model answered questions directly in plain text (never used tools). GRPO model learned to always respond with structured JSON tool calls. Tested on 12 queries across weather, calculator, search, stocks, translation and unit conversion.
!pip install -q transformers datasets peft trl accelerate bitsandbytes
!python EXP_STEP1_sft.py # ~20 min
!python EXP_STEP2_grpo.py # ~30 min
!python EXP_STEP3_compare.py
| File | Description |
|---|---|
EXP_STEP1_sft.py |
SFT training (Colab) |
EXP_STEP2_grpo.py |
GRPO RL training (Colab) |
EXP_STEP3_compare.py |
Evaluation + comparison |
Qwen2.5-1.5B β’ QLoRA β’ GRPO β’ glaive-function-calling-v2 β’ trl β’ peft