thegovind
/

azure-advisor-qwen35-0.8b-grpo

Text Generation

reinforcement-learning

Model card Files Files and versions

Azure Advisor Qwen3.5-0.8B (GRPO)

SFT + rejection-sampling GRPO on Qwen/Qwen3.5-0.8B.

Hill-climbing

iter	train loss	avg gen reward	max-of-N
1	0.0563	6.040	6.040
2	0.0373	5.705	5.794
3	0.0328	7.211	7.211

Prior 0.5B Baseline

Stage	Qwen2.5-0.5B	Qwen3.5-0.8B
Pre-SFT reward / 10	0.80	TBD
Post-SFT reward / 10	3.43	4.69
GRPO iter 3 gen reward	4.44	4.69

Code

https://github.com/thegovind/azure-advisor-model

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for thegovind/azure-advisor-qwen35-0.8b-grpo

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Adapter

(144)

this model

Datasets used to train thegovind/azure-advisor-qwen35-0.8b-grpo

Space using thegovind/azure-advisor-qwen35-0.8b-grpo 1