Azure Cloud Solution Architect - Qwen 3.5 0.8B (GRPO LoRA)

A GRPO-trained (Group Relative Policy Optimization) LoRA adapter that gives Qwen 3.5 0.8B structured reasoning capabilities for Azure architecture questions.

This is the second stage of a two-stage pipeline: SFT first taught Azure knowledge, then GRPO taught the model to reason through problems with a structured <REASONING> → <SOLUTION> format.

What This Model Does

Answers multi-choice Azure architecture questions with structured reasoning
Produces output in <REASONING>...</REASONING> and <SOLUTION>...</SOLUTION> format
References specific Azure services in its reasoning
Trained with 4 reward signals: format compliance, answer correctness, Azure relevance, reasoning quality

Example Output

Question: Which Azure service handles global load balancing?
A. Azure Load Balancer  B. Azure Front Door  C. Traffic Manager  D. Application Gateway

<REASONING>
Azure Front Door provides global HTTP/HTTPS load balancing with built-in CDN, 
WAF, and SSL offloading. It operates at Layer 7 and routes traffic to the 
closest healthy backend across regions. Azure Load Balancer is regional (Layer 4), 
Traffic Manager is DNS-based (slower failover), and Application Gateway is 
regional Layer 7. For global load balancing with low latency, Front Door is ideal.
</REASONING>
<SOLUTION>B</SOLUTION>

Training Details

Parameter	Value
Base Model	`unsloth/Qwen3.5-0.8B` (with SFT LoRA from stage 1)
Method	GRPO with GSPO variant (`loss_type=dr_grpo`)
LoRA Rank	16
Dataset	thegovind/azure-architecture-grpo-benchmark (200 train / 51 eval)
Training Time	~4 hours 40 minutes on RTX 4090
Steps	200
Generations per Prompt	2
Learning Rate	5e-6
Peak Reward	5.5 / 7.0 (step 195)
Hardware	1x NVIDIA RTX 4090 (24GB)

Reward Functions (Rubric)

Signal	Max Score	What It Measures
R1 — Format Compliance	+2.0	Proper `<REASONING>` and `<SOLUTION>` XML tags
R2 — Answer Correctness	+3.0	Exact match on A/B/C/D answer letter
R3 — Azure Relevance	+1.0	Mentions relevant Azure services in reasoning
R4 — Reasoning Quality	+1.0	Substantive reasoning (50–500 words)
Total	7.0

Training Progression

Steps 1-30:   Model learns to use <REASONING>/<SOLUTION> tags (R1 improves)
Steps 30-100: Model starts getting correct answers (R2 improves)  
Steps 100-200: Reasoning quality and Azure relevance refine (R3, R4 improve)
Peak reward:  5.5/7.0 at step 195

How to Use

!pip install -q --upgrade transformers peft accelerate torch

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "unsloth/Qwen3.5-0.8B",
    torch_dtype=torch.float16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "thegovind/azure-architect-qwen35-0.8b-grpo")
tokenizer = AutoTokenizer.from_pretrained("thegovind/azure-architect-qwen35-0.8b-grpo")

SYSTEM_PROMPT = (
    "You are an expert Azure Cloud Solution Architect. "
    "Provide reasoning in <REASONING></REASONING> tags, "
    "then your answer in <SOLUTION></SOLUTION> tags."
)

question = """Which Azure service is best for real-time fraud detection at scale?
A. Azure Batch
B. Azure Stream Analytics with Event Hubs
C. Azure SQL Database
D. Azure Blob Storage"""

prompt = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{question}\n\nProvide reasoning in <REASONING></REASONING> tags and answer in <SOLUTION></SOLUTION> tags.<|im_end|>\n<|im_start|>assistant\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=1024, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Two-Stage Pipeline

Stage 1: SFT — "Learn Azure knowledge from 1,678 Q&A pairs"
    → thegovind/azure-architect-qwen35-0.8b

Stage 2: GRPO — "Learn to reason through problems via RL with 4 reward signals"  
    → thegovind/azure-architect-qwen35-0.8b-grpo (this model)

Related Models & Resources

Resource	Link
SFT LoRA	thegovind/azure-architect-qwen35-0.8b
SFT Merged	thegovind/azure-architect-qwen35-0.8b-merged
GRPO LoRA (this)	thegovind/azure-architect-qwen35-0.8b-grpo
GRPO Merged	thegovind/azure-architect-qwen35-0.8b-grpo-merged
Training Dataset	thegovind/azure-architecture-vqa
Benchmark	thegovind/azure-architecture-grpo-benchmark

What is GRPO?

GRPO (Group Relative Policy Optimization) is a reinforcement learning method for language models. Instead of needing a separate "critic" model (like PPO/RLHF), it generates multiple answers per question and compares them relatively — reinforcing behaviors from better answers and discouraging those from worse ones. We used the GSPO/dr_grpo variant which is more stable for small models.

Downloads last month: 1

Model tree for thegovind/azure-architect-qwen35-0.8b-grpo

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Finetuned

unsloth/Qwen3.5-0.8B

Adapter

(26)

this model

thegovind
/

azure-architect-qwen35-0.8b-grpo