Instructions to use thegovind/azure-architect-qwen35-0.8b-grpo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use thegovind/azure-architect-qwen35-0.8b-grpo with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen3.5-0.8B") model = PeftModel.from_pretrained(base_model, "thegovind/azure-architect-qwen35-0.8b-grpo") - Notebooks
- Google Colab
- Kaggle
Azure Cloud Solution Architect - Qwen 3.5 0.8B (GRPO LoRA)
A GRPO-trained (Group Relative Policy Optimization) LoRA adapter that gives Qwen 3.5 0.8B structured reasoning capabilities for Azure architecture questions.
This is the second stage of a two-stage pipeline: SFT first taught Azure knowledge, then GRPO taught the model to reason through problems with a structured <REASONING> → <SOLUTION> format.
What This Model Does
- Answers multi-choice Azure architecture questions with structured reasoning
- Produces output in
<REASONING>...</REASONING>and<SOLUTION>...</SOLUTION>format - References specific Azure services in its reasoning
- Trained with 4 reward signals: format compliance, answer correctness, Azure relevance, reasoning quality
Example Output
Question: Which Azure service handles global load balancing?
A. Azure Load Balancer B. Azure Front Door C. Traffic Manager D. Application Gateway
<REASONING>
Azure Front Door provides global HTTP/HTTPS load balancing with built-in CDN,
WAF, and SSL offloading. It operates at Layer 7 and routes traffic to the
closest healthy backend across regions. Azure Load Balancer is regional (Layer 4),
Traffic Manager is DNS-based (slower failover), and Application Gateway is
regional Layer 7. For global load balancing with low latency, Front Door is ideal.
</REASONING>
<SOLUTION>B</SOLUTION>
Training Details
| Parameter | Value |
|---|---|
| Base Model | unsloth/Qwen3.5-0.8B (with SFT LoRA from stage 1) |
| Method | GRPO with GSPO variant (loss_type=dr_grpo) |
| LoRA Rank | 16 |
| Dataset | thegovind/azure-architecture-grpo-benchmark (200 train / 51 eval) |
| Training Time | ~4 hours 40 minutes on RTX 4090 |
| Steps | 200 |
| Generations per Prompt | 2 |
| Learning Rate | 5e-6 |
| Peak Reward | 5.5 / 7.0 (step 195) |
| Hardware | 1x NVIDIA RTX 4090 (24GB) |
Reward Functions (Rubric)
| Signal | Max Score | What It Measures |
|---|---|---|
| R1 — Format Compliance | +2.0 | Proper <REASONING> and <SOLUTION> XML tags |
| R2 — Answer Correctness | +3.0 | Exact match on A/B/C/D answer letter |
| R3 — Azure Relevance | +1.0 | Mentions relevant Azure services in reasoning |
| R4 — Reasoning Quality | +1.0 | Substantive reasoning (50–500 words) |
| Total | 7.0 |
Training Progression
Steps 1-30: Model learns to use <REASONING>/<SOLUTION> tags (R1 improves)
Steps 30-100: Model starts getting correct answers (R2 improves)
Steps 100-200: Reasoning quality and Azure relevance refine (R3, R4 improve)
Peak reward: 5.5/7.0 at step 195
How to Use
!pip install -q --upgrade transformers peft accelerate torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base_model = AutoModelForCausalLM.from_pretrained(
"unsloth/Qwen3.5-0.8B",
torch_dtype=torch.float16,
device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "thegovind/azure-architect-qwen35-0.8b-grpo")
tokenizer = AutoTokenizer.from_pretrained("thegovind/azure-architect-qwen35-0.8b-grpo")
SYSTEM_PROMPT = (
"You are an expert Azure Cloud Solution Architect. "
"Provide reasoning in <REASONING></REASONING> tags, "
"then your answer in <SOLUTION></SOLUTION> tags."
)
question = """Which Azure service is best for real-time fraud detection at scale?
A. Azure Batch
B. Azure Stream Analytics with Event Hubs
C. Azure SQL Database
D. Azure Blob Storage"""
prompt = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{question}\n\nProvide reasoning in <REASONING></REASONING> tags and answer in <SOLUTION></SOLUTION> tags.<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=1024, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Two-Stage Pipeline
Stage 1: SFT — "Learn Azure knowledge from 1,678 Q&A pairs"
→ thegovind/azure-architect-qwen35-0.8b
Stage 2: GRPO — "Learn to reason through problems via RL with 4 reward signals"
→ thegovind/azure-architect-qwen35-0.8b-grpo (this model)
Related Models & Resources
| Resource | Link |
|---|---|
| SFT LoRA | thegovind/azure-architect-qwen35-0.8b |
| SFT Merged | thegovind/azure-architect-qwen35-0.8b-merged |
| GRPO LoRA (this) | thegovind/azure-architect-qwen35-0.8b-grpo |
| GRPO Merged | thegovind/azure-architect-qwen35-0.8b-grpo-merged |
| Training Dataset | thegovind/azure-architecture-vqa |
| Benchmark | thegovind/azure-architecture-grpo-benchmark |
What is GRPO?
GRPO (Group Relative Policy Optimization) is a reinforcement learning method for language models. Instead of needing a separate "critic" model (like PPO/RLHF), it generates multiple answers per question and compares them relatively — reinforcing behaviors from better answers and discouraging those from worse ones. We used the GSPO/dr_grpo variant which is more stable for small models.
- Downloads last month
- 1