---
license: apache-2.0
base_model: Qwen/Qwen2.5-3B-Instruct
tags:
- voice-assistant
- agent
- lora
- ssml
- tool-calling
- motivational-interviewing
- whissle
- peft
datasets:
- WhissleAI/whissle-agent-llm-training-data
language:
- en
- hi
library_name: peft
pipeline_tag: text-generation
model-index:
- name: whissle-agent-lora-3b-test
results:
- task:
type: text-generation
name: Structured Agent Response Generation
metrics:
- name: JSON Valid
type: accuracy
value: 1.0
- name: Has SSML
type: accuracy
value: 1.0
- name: Has Prosody/Emotion
type: accuracy
value: 0.98
- name: Has Tool Calls
type: accuracy
value: 0.66
- name: Tool Match vs GT
type: accuracy
value: 0.545
- name: Has MI Codes
type: accuracy
value: 1.0
- name: Voice Appropriate Length
type: accuracy
value: 0.96
---
# Whissle Agent LLM — LoRA Adapter for Qwen2.5-3B-Instruct
A **LoRA adapter** fine-tuned on top of [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) for the **Whissle AI voice assistant** pipeline.
This model converts structured ASR perception (transcript + emotion + intent + entities) into structured JSON agent responses with:
- **SSML prosody tags** (emotion, rate, pitch, emphasis, break markers) for TTS
- **Tool calls** (calendar, weather, reminders, search, etc.)
- **Motivational Interviewing (MI) codes** for empathetic, behaviorally-informed responses
- **Turn control** markers for conversation flow management
- **Reasoning** field for chain-of-thought (hidden from user, used for quality)
Built by [Whissle](https://whissle.ai) as part of the [PromptingNemo](https://github.com/WhissleAI/PromptingNemo) framework.
## Try It
**[Open the Colab notebook](https://colab.research.google.com/drive/16JaubplO7r_pnlBJosniTZdmZSHMdR4x?usp=sharing)** to test the model on a free T4 GPU — no local setup needed.
## Evaluation Results — Base vs LoRA Fine-tuned
| Metric | Base (Qwen 2.5 3B) | LoRA Fine-tuned | Delta |
|---|---|---|---|
| **JSON valid** | 96% | 100% | +4% |
| **Has reasoning** | 42% | 96% | +54% |
| **Has SSML** | 48% | 100% | +52% |
| **Has prosody/emotion** | 0% | 98% | +98% |
| **Has break tags** | 8% | 90% | +82% |
| **Has tool calls** | 2% | 66% | +64% |
| **Tool match (vs GT)** | 0.0% | 54.5% | +54.5% |
| **Has MI codes** | 82% | 100% | +18% |
| **Voice appropriate** | 70% | 96% | +26% |
| **Avg response words** | 27.9 | 19.5 | -8.4 (more concise) |
| **Inference time (50 samples)** | 338s | 414s | +22% slower |
## Training Details
| Parameter | Value |
|---|---|
| Base model | `Qwen/Qwen2.5-3B-Instruct` |
| Method | QLoRA (PEFT) |
| LoRA rank (r) | 32 |
| LoRA alpha | 64 |
| Target modules | `q_proj`, `k_proj`, `v_proj`, `o_proj` |
| LoRA dropout | 0.05 |
| Epochs | 3 |
| Learning rate | 0.0002 |
| Max sequence length | 2048 |
| Training samples | 5,171 |
| Validation samples | 272 |
| Precision | bf16 |
| Domains | general, finance, sales |
| Hardware | NVIDIA A100 40GB (GKE) |
## How to Use
### Quick Start — LoRA Adapter
```python
import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
BASE_MODEL = "Qwen/Qwen2.5-3B-Instruct"
ADAPTER = "WhissleAI/whissle-agent-lora-3b-test"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True
)
model = PeftModel.from_pretrained(model, ADAPTER)
model.eval()
SYSTEM_PROMPT = (
"You are Lulu, a helpful AI voice assistant by Whissle. "
"You receive structured perception from the ASR system showing the user's "
"transcript, emotion, intent, and context. Generate a JSON response with "
"turn_control, reasoning, response (with SSML prosody tags), tool_calls, "
"and mi_codes_used. Keep responses under 2 sentences for voice. "
"Available tools: search_web, set_reminder, check_calendar, send_message, "
"play_music, get_weather, set_alarm, create_note, make_call, get_directions, "
"add_to_list, set_timer, translate."
)
perception = {
"transcript": "What's the weather like in San Francisco?",
"emotion": "NEUTRAL",
"speech_act": "QUESTION",
"generic_intent": "REQUEST",
"agent_intent": "WEATHER_CHECK",
"urgency": "LOW",
"language": "en",
"domain": "general",
}
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"<|perception|>\n{json.dumps(perception, indent=2)}\n<|/perception|>"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True, top_p=0.9)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
agent_response = json.loads(response)
print(json.dumps(agent_response, indent=2))
```
### Merge Adapter Weights (for faster inference)
```python
import torch
from transformers import AutoModelForCausalLM
from peft import PeftModel
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-3B-Instruct", torch_dtype=torch.bfloat16, trust_remote_code=True
)
model = PeftModel.from_pretrained(model, "WhissleAI/whissle-agent-lora-3b-test")
model = model.merge_and_unload()
model.save_pretrained("./whissle-agent-merged")
```
### Domain-Specific System Prompts
The model supports three domains, each with a specialized system prompt:
- **General** — Personal assistant (alarms, weather, reminders, search)
- **Finance** — Collections & payments with MI-adherent empathetic handling
- **Sales** — Consultative selling with objection handling
See the [training code](https://github.com/WhissleAI/PromptingNemo/blob/main/recipes/llm_lora/train_agent_lora.py) for the full domain-specific system prompts.
## Response Format
The model outputs structured JSON:
```json
{
"turn_control": "RESPOND",
"reasoning": "Simple on/off request. Confirm action.",
"response": "Turning off the lamp in the living room.",
"mi_codes_used": ["GI"],
"tool_calls": [
{
"tool": "turn_off",
"parameters": {"device": "living_room_lamp"}
}
]
}
```
### Fields
| Field | Description |
|---|---|
| `turn_control` | `RESPOND`, `SELF`, `YIELD`, `INTERRUPT` — controls conversation flow |
| `reasoning` | Chain-of-thought (hidden from user, used for quality monitoring) |
| `response` | SSML-tagged text for TTS with prosody emotion, rate, pitch, and break markers |
| `tool_calls` | Array of tool invocations with tool name and arguments |
| `mi_codes_used` | Motivational Interviewing behavior codes (AFFIRM, REFLECT, SUPPORT, GI, etc.) |
## Perception Input Format
The model expects ASR perception as a structured JSON block wrapped in `<|perception|>` tags:
```json
{
"transcript": "user's speech transcript",
"emotion": "NEUTRAL|HAPPY|SAD|ANGRY|FEARFUL|SURPRISED|DISGUSTED",
"speech_act": "QUESTION|COMMAND|STATEMENT|GREETING|FAREWELL",
"generic_intent": "REQUEST|INFORM|CONFIRM|DENY|GREET",
"agent_intent": "WEATHER_CHECK|ALARM_SET|REMINDER_SET|...",
"urgency": "LOW|MEDIUM|HIGH|CRITICAL",
"language": "en|hi|...",
"domain": "general|finance|sales",
"entities": [{"entity": "type", "value": "extracted_value"}],
"mi_behavior": "DIRECT|REFLECT|AFFIRM|..."
}
```
## Limitations
- Inference is ~22% slower than base model due to LoRA adapter overhead (use merged weights for production)
- Tool call accuracy is 54.5% vs ground truth — complex multi-tool scenarios need improvement
- Trained primarily on English and Hinglish; other languages may produce degraded output
- Break tag placement is 90% — some edge cases may have suboptimal pause timing
## Citation
```bibtex
@misc{whissle-agent-lora-2026,
title={Whissle Agent LLM: LoRA Fine-tuned Qwen2.5-3B for Voice AI Agents},
author={Whissle AI},
year={2026},
url={https://huggingface.co/WhissleAI/whissle-agent-lora-3b-test},
note={Part of the PromptingNemo framework}
}
```
## License
Apache 2.0