--- license: apache-2.0 language: - en tags: - azure - cloud-architecture - qwen3.5 - grpo - gspo - reinforcement-learning - vision-language-model base_model: unsloth/Qwen3.5-0.8B datasets: - thegovind/azure-architecture-grpo-benchmark - thegovind/azure-architecture-vqa pipeline_tag: text-generation --- # Azure Cloud Solution Architect - Qwen 3.5 0.8B (GRPO Merged) A **fully merged** Qwen 3.5 0.8B model trained with GRPO (Group Relative Policy Optimization) to be an Azure Cloud Solution Architect with structured reasoning capabilities. This is the LoRA adapters merged into the base weights — ready for deployment with no adapter loading needed. ## What This Model Does - Answers multi-choice Azure architecture questions with structured reasoning - Produces output in `...` and `...` format - References specific Azure services in its reasoning - Trained with 4 reward signals: format compliance, answer correctness, Azure relevance, reasoning quality ## Example Output ``` Question: Which Azure service handles global load balancing? A. Azure Load Balancer B. Azure Front Door C. Traffic Manager D. Application Gateway Azure Front Door provides global HTTP/HTTPS load balancing with built-in CDN, WAF, and SSL offloading. It operates at Layer 7 and routes traffic to the closest healthy backend across regions. Azure Load Balancer is regional (Layer 4), Traffic Manager is DNS-based (slower failover), and Application Gateway is regional Layer 7. For global load balancing with low latency, Front Door is ideal. B ``` ## Training Details | Parameter | Value | |-----------|-------| | Base Model | `unsloth/Qwen3.5-0.8B` | | Method | SFT → GRPO with GSPO variant (`loss_type=dr_grpo`), then merged | | LoRA Rank | 16 | | SFT Dataset | [thegovind/azure-architecture-vqa](https://huggingface.co/datasets/thegovind/azure-architecture-vqa) (1,678 train examples) | | GRPO Dataset | [thegovind/azure-architecture-grpo-benchmark](https://huggingface.co/datasets/thegovind/azure-architecture-grpo-benchmark) (200 train / 51 eval) | | SFT Training | 42.6 min, 210 steps, loss 0.6517 | | GRPO Training | ~4 hr 40 min, 200 steps, peak reward 5.5/7.0 | | Hardware | 1x NVIDIA RTX 4090 (24GB) | | Total Cost | ~$1.88 on vast.ai | ### Reward Functions (Rubric) | Signal | Max Score | What It Measures | |--------|-----------|-----------------| | R1 — Format Compliance | +2.0 | Proper `` and `` XML tags | | R2 — Answer Correctness | +3.0 | Exact match on A/B/C/D answer letter | | R3 — Azure Relevance | +1.0 | Mentions relevant Azure services in reasoning | | R4 — Reasoning Quality | +1.0 | Substantive reasoning (50–500 words) | | **Total** | **7.0** | | ## How to Use ```python !pip install -q --upgrade transformers accelerate torch from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained( "thegovind/azure-architect-qwen35-0.8b-grpo-merged", torch_dtype=torch.float16, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("thegovind/azure-architect-qwen35-0.8b-grpo-merged") SYSTEM_PROMPT = ( "You are an expert Azure Cloud Solution Architect. " "Provide reasoning in tags, " "then your answer in tags." ) question = """Which Azure service is best for real-time fraud detection at scale? A. Azure Batch B. Azure Stream Analytics with Event Hubs C. Azure SQL Database D. Azure Blob Storage""" prompt = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{question}\n\nProvide reasoning in tags and answer in tags.<|im_end|>\n<|im_start|>assistant\n" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=1024, temperature=0.7, do_sample=True) print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)) ``` ## When to Use This vs. the LoRA Version | Version | Use When | |---------|----------| | **This (merged)** | Deployment, inference servers, GGUF conversion, Foundry Local, vLLM | | **LoRA** | Further fine-tuning, experimentation, saving storage (43 MB vs 1.6 GB) | ## Two-Stage Training Pipeline ``` Stage 1: SFT — "Learn Azure knowledge from 1,678 Q&A pairs" → Supervised Fine-Tuning on Azure Architecture Center content Stage 2: GRPO — "Learn to reason through problems via RL with 4 reward signals" → Reinforcement Learning with structured output format Merge — LoRA adapters merged into base weights for easy deployment → This model ``` ## Related Models & Resources | Resource | Link | |----------|------| | SFT LoRA | [thegovind/azure-architect-qwen35-0.8b](https://huggingface.co/thegovind/azure-architect-qwen35-0.8b) | | SFT Merged | [thegovind/azure-architect-qwen35-0.8b-merged](https://huggingface.co/thegovind/azure-architect-qwen35-0.8b-merged) | | GRPO LoRA | [thegovind/azure-architect-qwen35-0.8b-grpo](https://huggingface.co/thegovind/azure-architect-qwen35-0.8b-grpo) | | GRPO Merged (this) | [thegovind/azure-architect-qwen35-0.8b-grpo-merged](https://huggingface.co/thegovind/azure-architect-qwen35-0.8b-grpo-merged) | | Training Dataset | [thegovind/azure-architecture-vqa](https://huggingface.co/datasets/thegovind/azure-architecture-vqa) | | Benchmark | [thegovind/azure-architecture-grpo-benchmark](https://huggingface.co/datasets/thegovind/azure-architecture-grpo-benchmark) |