thegovind/azure-advisor-sft
Viewer • Updated • 410 • 74
SFT + rejection-sampling GRPO on Qwen/Qwen3.5-0.8B.
| iter | train loss | avg gen reward | max-of-N |
|---|---|---|---|
| 1 | 0.0563 | 6.040 | 6.040 |
| 2 | 0.0373 | 5.705 | 5.794 |
| 3 | 0.0328 | 7.211 | 7.211 |
| Stage | Qwen2.5-0.5B | Qwen3.5-0.8B |
|---|---|---|
| Pre-SFT reward / 10 | 0.80 | TBD |
| Post-SFT reward / 10 | 3.43 | 4.69 |
| GRPO iter 3 gen reward | 4.44 | 4.69 |