Azure Advisor Qwen3.5-0.8B (GRPO)

SFT + rejection-sampling GRPO on Qwen/Qwen3.5-0.8B.

Hill-climbing

iter train loss avg gen reward max-of-N
1 0.0563 6.040 6.040
2 0.0373 5.705 5.794
3 0.0328 7.211 7.211

Prior 0.5B Baseline

Stage Qwen2.5-0.5B Qwen3.5-0.8B
Pre-SFT reward / 10 0.80 TBD
Post-SFT reward / 10 3.43 4.69
GRPO iter 3 gen reward 4.44 4.69

Code

https://github.com/thegovind/azure-advisor-model

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thegovind/azure-advisor-qwen35-0.8b-grpo

Adapter
(144)
this model

Datasets used to train thegovind/azure-advisor-qwen35-0.8b-grpo

Space using thegovind/azure-advisor-qwen35-0.8b-grpo 1