--- license: apache-2.0 base_model: Qwen/Qwen2.5-3B-Instruct tags: - voice-assistant - agent - lora - ssml - tool-calling - motivational-interviewing - whissle - peft datasets: - WhissleAI/whissle-agent-llm-training-data language: - en - hi library_name: peft pipeline_tag: text-generation model-index: - name: whissle-agent-lora-3b-test results: - task: type: text-generation name: Structured Agent Response Generation metrics: - name: JSON Valid type: accuracy value: 1.0 - name: Has SSML type: accuracy value: 1.0 - name: Has Prosody/Emotion type: accuracy value: 0.98 - name: Has Tool Calls type: accuracy value: 0.66 - name: Tool Match vs GT type: accuracy value: 0.545 - name: Has MI Codes type: accuracy value: 1.0 - name: Voice Appropriate Length type: accuracy value: 0.96 --- # Whissle Agent LLM — LoRA Adapter for Qwen2.5-3B-Instruct Open In Colab A **LoRA adapter** fine-tuned on top of [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) for the **Whissle AI voice assistant** pipeline. This model converts structured ASR perception (transcript + emotion + intent + entities) into structured JSON agent responses with: - **SSML prosody tags** (emotion, rate, pitch, emphasis, break markers) for TTS - **Tool calls** (calendar, weather, reminders, search, etc.) - **Motivational Interviewing (MI) codes** for empathetic, behaviorally-informed responses - **Turn control** markers for conversation flow management - **Reasoning** field for chain-of-thought (hidden from user, used for quality) Built by [Whissle](https://whissle.ai) as part of the [PromptingNemo](https://github.com/WhissleAI/PromptingNemo) framework. ## Try It **[Open the Colab notebook](https://colab.research.google.com/drive/16JaubplO7r_pnlBJosniTZdmZSHMdR4x?usp=sharing)** to test the model on a free T4 GPU — no local setup needed. ## Evaluation Results — Base vs LoRA Fine-tuned | Metric | Base (Qwen 2.5 3B) | LoRA Fine-tuned | Delta | |---|---|---|---| | **JSON valid** | 96% | 100% | +4% | | **Has reasoning** | 42% | 96% | +54% | | **Has SSML** | 48% | 100% | +52% | | **Has prosody/emotion** | 0% | 98% | +98% | | **Has break tags** | 8% | 90% | +82% | | **Has tool calls** | 2% | 66% | +64% | | **Tool match (vs GT)** | 0.0% | 54.5% | +54.5% | | **Has MI codes** | 82% | 100% | +18% | | **Voice appropriate** | 70% | 96% | +26% | | **Avg response words** | 27.9 | 19.5 | -8.4 (more concise) | | **Inference time (50 samples)** | 338s | 414s | +22% slower | ## Training Details | Parameter | Value | |---|---| | Base model | `Qwen/Qwen2.5-3B-Instruct` | | Method | QLoRA (PEFT) | | LoRA rank (r) | 32 | | LoRA alpha | 64 | | Target modules | `q_proj`, `k_proj`, `v_proj`, `o_proj` | | LoRA dropout | 0.05 | | Epochs | 3 | | Learning rate | 0.0002 | | Max sequence length | 2048 | | Training samples | 5,171 | | Validation samples | 272 | | Precision | bf16 | | Domains | general, finance, sales | | Hardware | NVIDIA A100 40GB (GKE) | ## How to Use ### Quick Start — LoRA Adapter ```python import json import torch from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel BASE_MODEL = "Qwen/Qwen2.5-3B-Instruct" ADAPTER = "WhissleAI/whissle-agent-lora-3b-test" tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( BASE_MODEL, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True ) model = PeftModel.from_pretrained(model, ADAPTER) model.eval() SYSTEM_PROMPT = ( "You are Lulu, a helpful AI voice assistant by Whissle. " "You receive structured perception from the ASR system showing the user's " "transcript, emotion, intent, and context. Generate a JSON response with " "turn_control, reasoning, response (with SSML prosody tags), tool_calls, " "and mi_codes_used. Keep responses under 2 sentences for voice. " "Available tools: search_web, set_reminder, check_calendar, send_message, " "play_music, get_weather, set_alarm, create_note, make_call, get_directions, " "add_to_list, set_timer, translate." ) perception = { "transcript": "What's the weather like in San Francisco?", "emotion": "NEUTRAL", "speech_act": "QUESTION", "generic_intent": "REQUEST", "agent_intent": "WEATHER_CHECK", "urgency": "LOW", "language": "en", "domain": "general", } messages = [ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": f"<|perception|>\n{json.dumps(perception, indent=2)}\n<|/perception|>"}, ] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True, top_p=0.9) response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) agent_response = json.loads(response) print(json.dumps(agent_response, indent=2)) ``` ### Merge Adapter Weights (for faster inference) ```python import torch from transformers import AutoModelForCausalLM from peft import PeftModel model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen2.5-3B-Instruct", torch_dtype=torch.bfloat16, trust_remote_code=True ) model = PeftModel.from_pretrained(model, "WhissleAI/whissle-agent-lora-3b-test") model = model.merge_and_unload() model.save_pretrained("./whissle-agent-merged") ``` ### Domain-Specific System Prompts The model supports three domains, each with a specialized system prompt: - **General** — Personal assistant (alarms, weather, reminders, search) - **Finance** — Collections & payments with MI-adherent empathetic handling - **Sales** — Consultative selling with objection handling See the [training code](https://github.com/WhissleAI/PromptingNemo/blob/main/recipes/llm_lora/train_agent_lora.py) for the full domain-specific system prompts. ## Response Format The model outputs structured JSON: ```json { "turn_control": "RESPOND", "reasoning": "Simple on/off request. Confirm action.", "response": "Turning off the lamp in the living room.", "mi_codes_used": ["GI"], "tool_calls": [ { "tool": "turn_off", "parameters": {"device": "living_room_lamp"} } ] } ``` ### Fields | Field | Description | |---|---| | `turn_control` | `RESPOND`, `SELF`, `YIELD`, `INTERRUPT` — controls conversation flow | | `reasoning` | Chain-of-thought (hidden from user, used for quality monitoring) | | `response` | SSML-tagged text for TTS with prosody emotion, rate, pitch, and break markers | | `tool_calls` | Array of tool invocations with tool name and arguments | | `mi_codes_used` | Motivational Interviewing behavior codes (AFFIRM, REFLECT, SUPPORT, GI, etc.) | ## Perception Input Format The model expects ASR perception as a structured JSON block wrapped in `<|perception|>` tags: ```json { "transcript": "user's speech transcript", "emotion": "NEUTRAL|HAPPY|SAD|ANGRY|FEARFUL|SURPRISED|DISGUSTED", "speech_act": "QUESTION|COMMAND|STATEMENT|GREETING|FAREWELL", "generic_intent": "REQUEST|INFORM|CONFIRM|DENY|GREET", "agent_intent": "WEATHER_CHECK|ALARM_SET|REMINDER_SET|...", "urgency": "LOW|MEDIUM|HIGH|CRITICAL", "language": "en|hi|...", "domain": "general|finance|sales", "entities": [{"entity": "type", "value": "extracted_value"}], "mi_behavior": "DIRECT|REFLECT|AFFIRM|..." } ``` ## Limitations - Inference is ~22% slower than base model due to LoRA adapter overhead (use merged weights for production) - Tool call accuracy is 54.5% vs ground truth — complex multi-tool scenarios need improvement - Trained primarily on English and Hinglish; other languages may produce degraded output - Break tag placement is 90% — some edge cases may have suboptimal pause timing ## Citation ```bibtex @misc{whissle-agent-lora-2026, title={Whissle Agent LLM: LoRA Fine-tuned Qwen2.5-3B for Voice AI Agents}, author={Whissle AI}, year={2026}, url={https://huggingface.co/WhissleAI/whissle-agent-lora-3b-test}, note={Part of the PromptingNemo framework} } ``` ## License Apache 2.0