sentinel-scam-honeypo / audit /07_Model_Strategy_Switching.md
avinash-rai's picture
Deployment Ready: Fixed scam detection low confidence, added production audit report, optimized throttles
1838600
|
Raw
History Blame
2.67 kB

Topic 7: Model Strategy & Model Switching Logic

Audit Date: 2026-02-01 Auditor: Agent Antigravity Scope: LLM Integration Strategy (Real vs Fallback)


1. The Strategy: "Role-Based Cognitive Routing"

The system does not rely on a single model. Instead, it assigns Roles to optimal models based on capability, cost, and latency.

The 4 Primary Roles

Role Primary Model Why?
SMART_REASONING llama-3.3-70b-versatile Best balance of IQ (70B) and Speed. Handles Context Injection & Persona.
FAST_CHAT llama-3.1-8b-instant Ultra-low latency (<0.5s) for simple replies or stall tactics.
STRUCTURED_OUTPUT openai/gpt-oss-20b Fine-tuned for JSON Schema. Used for Scam Detection & Extraction.
SAFETY_GUARD openai/gpt-oss-safeguard-20b Specialized for Prompt Injection detection.

2. Model Switching Logic (The Switchboard)

Implemented in app/core/llm_client.py & model_registry.py

The Switchboard automatically re-routes traffic based on 3 triggers:

A. Volume Trigger (Cost/TPM)

  • Logic: If the current Request Token count > 70% of the Model's TPM (Tokens Per Minute) limit.
  • Action: Downgrade from 70B (Versatile) -> 8B (Instant) or 17B (Scout).
  • Goal: Prevent HTTP 429 Errors and keep the chat alive.

B. Context Trigger (Overflow)

  • Logic: If Conversation history > 100k tokens.
  • Action: Switch to moonshotai/kimi-k2-instruct (200k Context Window).
  • Goal: Prevent "Context Window Exceeded" errors in long fraud sessions.

C. Capability Trigger (Strict Mode)

  • Logic: If the prompt requires json_schema (Strict Mode) and the current model doesn't support it.
  • Action: Force switch to gpt-oss-20b.

3. Cost Control Strategy

Strategy Implementation Savings
Prompt Caching The User Profile & Taxonomy are identical across calls. gpt-oss-20b caches these prefixes. ~50%
Small Model Offloading Simple "Stall" messages ("Wait...", "Hello?") are routed to 8B. ~80%
Rate Limiter rate_limiter.py enforces a max budget per session. 100% (Safety)

4. Truth Table: Reality Check

Claim Reality Status
"Uses Llama 3.3" CONFIRMED Primary for Persona generation.
"Uses OpenAI GPT-4" FALLBACK ONLY Mapped in fallbacks but primarily uses Groq (Llama) for speed.
"Auto-Switching" REAL _switchboard() function in llm_client.py handles this logic dynamically.