Azure Cloud Solution Architect - Qwen 3.5 0.8B (GRPO LoRA)

A GRPO-trained (Group Relative Policy Optimization) LoRA adapter that gives Qwen 3.5 0.8B structured reasoning capabilities for Azure architecture questions.

This is the second stage of a two-stage pipeline: SFT first taught Azure knowledge, then GRPO taught the model to reason through problems with a structured <REASONING><SOLUTION> format.

What This Model Does

  • Answers multi-choice Azure architecture questions with structured reasoning
  • Produces output in <REASONING>...</REASONING> and <SOLUTION>...</SOLUTION> format
  • References specific Azure services in its reasoning
  • Trained with 4 reward signals: format compliance, answer correctness, Azure relevance, reasoning quality

Example Output

Question: Which Azure service handles global load balancing?
A. Azure Load Balancer  B. Azure Front Door  C. Traffic Manager  D. Application Gateway

<REASONING>
Azure Front Door provides global HTTP/HTTPS load balancing with built-in CDN, 
WAF, and SSL offloading. It operates at Layer 7 and routes traffic to the 
closest healthy backend across regions. Azure Load Balancer is regional (Layer 4), 
Traffic Manager is DNS-based (slower failover), and Application Gateway is 
regional Layer 7. For global load balancing with low latency, Front Door is ideal.
</REASONING>
<SOLUTION>B</SOLUTION>

Training Details

Parameter Value
Base Model unsloth/Qwen3.5-0.8B (with SFT LoRA from stage 1)
Method GRPO with GSPO variant (loss_type=dr_grpo)
LoRA Rank 16
Dataset thegovind/azure-architecture-grpo-benchmark (200 train / 51 eval)
Training Time ~4 hours 40 minutes on RTX 4090
Steps 200
Generations per Prompt 2
Learning Rate 5e-6
Peak Reward 5.5 / 7.0 (step 195)
Hardware 1x NVIDIA RTX 4090 (24GB)

Reward Functions (Rubric)

Signal Max Score What It Measures
R1 — Format Compliance +2.0 Proper <REASONING> and <SOLUTION> XML tags
R2 — Answer Correctness +3.0 Exact match on A/B/C/D answer letter
R3 — Azure Relevance +1.0 Mentions relevant Azure services in reasoning
R4 — Reasoning Quality +1.0 Substantive reasoning (50–500 words)
Total 7.0

Training Progression

Steps 1-30:   Model learns to use <REASONING>/<SOLUTION> tags (R1 improves)
Steps 30-100: Model starts getting correct answers (R2 improves)  
Steps 100-200: Reasoning quality and Azure relevance refine (R3, R4 improve)
Peak reward:  5.5/7.0 at step 195

How to Use

!pip install -q --upgrade transformers peft accelerate torch

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "unsloth/Qwen3.5-0.8B",
    torch_dtype=torch.float16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "thegovind/azure-architect-qwen35-0.8b-grpo")
tokenizer = AutoTokenizer.from_pretrained("thegovind/azure-architect-qwen35-0.8b-grpo")

SYSTEM_PROMPT = (
    "You are an expert Azure Cloud Solution Architect. "
    "Provide reasoning in <REASONING></REASONING> tags, "
    "then your answer in <SOLUTION></SOLUTION> tags."
)

question = """Which Azure service is best for real-time fraud detection at scale?
A. Azure Batch
B. Azure Stream Analytics with Event Hubs
C. Azure SQL Database
D. Azure Blob Storage"""

prompt = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{question}\n\nProvide reasoning in <REASONING></REASONING> tags and answer in <SOLUTION></SOLUTION> tags.<|im_end|>\n<|im_start|>assistant\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=1024, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Two-Stage Pipeline

Stage 1: SFT — "Learn Azure knowledge from 1,678 Q&A pairs"
    → thegovind/azure-architect-qwen35-0.8b

Stage 2: GRPO — "Learn to reason through problems via RL with 4 reward signals"  
    → thegovind/azure-architect-qwen35-0.8b-grpo (this model)

Related Models & Resources

What is GRPO?

GRPO (Group Relative Policy Optimization) is a reinforcement learning method for language models. Instead of needing a separate "critic" model (like PPO/RLHF), it generates multiple answers per question and compares them relatively — reinforcing behaviors from better answers and discouraging those from worse ones. We used the GSPO/dr_grpo variant which is more stable for small models.

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thegovind/azure-architect-qwen35-0.8b-grpo

Adapter
(26)
this model

Datasets used to train thegovind/azure-architect-qwen35-0.8b-grpo

Space using thegovind/azure-architect-qwen35-0.8b-grpo 1