Spaces:
Runtime error
Runtime error
Deployment Ready: Fixed scam detection low confidence, added production audit report, optimized throttles
1838600 Topic 7: Model Strategy & Model Switching Logic
Audit Date: 2026-02-01 Auditor: Agent Antigravity Scope: LLM Integration Strategy (Real vs Fallback)
1. The Strategy: "Role-Based Cognitive Routing"
The system does not rely on a single model. Instead, it assigns Roles to optimal models based on capability, cost, and latency.
The 4 Primary Roles
| Role | Primary Model | Why? |
|---|---|---|
SMART_REASONING |
llama-3.3-70b-versatile |
Best balance of IQ (70B) and Speed. Handles Context Injection & Persona. |
FAST_CHAT |
llama-3.1-8b-instant |
Ultra-low latency (<0.5s) for simple replies or stall tactics. |
STRUCTURED_OUTPUT |
openai/gpt-oss-20b |
Fine-tuned for JSON Schema. Used for Scam Detection & Extraction. |
SAFETY_GUARD |
openai/gpt-oss-safeguard-20b |
Specialized for Prompt Injection detection. |
2. Model Switching Logic (The Switchboard)
Implemented in app/core/llm_client.py & model_registry.py
The Switchboard automatically re-routes traffic based on 3 triggers:
A. Volume Trigger (Cost/TPM)
- Logic: If the current Request Token count > 70% of the Model's TPM (Tokens Per Minute) limit.
- Action: Downgrade from
70B(Versatile) ->8B(Instant) or17B(Scout). - Goal: Prevent HTTP 429 Errors and keep the chat alive.
B. Context Trigger (Overflow)
- Logic: If Conversation history > 100k tokens.
- Action: Switch to
moonshotai/kimi-k2-instruct(200k Context Window). - Goal: Prevent "Context Window Exceeded" errors in long fraud sessions.
C. Capability Trigger (Strict Mode)
- Logic: If the prompt requires
json_schema(Strict Mode) and the current model doesn't support it. - Action: Force switch to
gpt-oss-20b.
3. Cost Control Strategy
| Strategy | Implementation | Savings |
|---|---|---|
| Prompt Caching | The User Profile & Taxonomy are identical across calls. gpt-oss-20b caches these prefixes. |
~50% |
| Small Model Offloading | Simple "Stall" messages ("Wait...", "Hello?") are routed to 8B. |
~80% |
| Rate Limiter | rate_limiter.py enforces a max budget per session. |
100% (Safety) |
4. Truth Table: Reality Check
| Claim | Reality | Status |
|---|---|---|
| "Uses Llama 3.3" | CONFIRMED | Primary for Persona generation. |
| "Uses OpenAI GPT-4" | FALLBACK ONLY | Mapped in fallbacks but primarily uses Groq (Llama) for speed. |
| "Auto-Switching" | REAL | _switchboard() function in llm_client.py handles this logic dynamically. |