sentinel-scam-honeypo

Runtime error

sentinel-scam-honeypo / audit /07_Model_Strategy_Switching.md

Deployment Ready: Fixed scam detection low confidence, added production audit report, optimized throttles

1838600 5 months ago

2.67 kB

Topic 7: Model Strategy & Model Switching Logic

Audit Date: 2026-02-01 Auditor: Agent Antigravity Scope: LLM Integration Strategy (Real vs Fallback)

The system does not rely on a single model. Instead, it assigns Roles to optimal models based on capability, cost, and latency.

Role	Primary Model	Why?
`SMART_REASONING`	`llama-3.3-70b-versatile`	Best balance of IQ (70B) and Speed. Handles Context Injection & Persona.
`FAST_CHAT`	`llama-3.1-8b-instant`	Ultra-low latency (<0.5s) for simple replies or stall tactics.
`STRUCTURED_OUTPUT`	`openai/gpt-oss-20b`	Fine-tuned for JSON Schema. Used for Scam Detection & Extraction.
`SAFETY_GUARD`	`openai/gpt-oss-safeguard-20b`	Specialized for Prompt Injection detection.

Implemented in app/core/llm_client.py & model_registry.py

The Switchboard automatically re-routes traffic based on 3 triggers:

Logic: If the current Request Token count > 70% of the Model's TPM (Tokens Per Minute) limit.
Action: Downgrade from 70B (Versatile) -> 8B (Instant) or 17B (Scout).
Goal: Prevent HTTP 429 Errors and keep the chat alive.

Logic: If the prompt requires json_schema (Strict Mode) and the current model doesn't support it.
Action: Force switch to gpt-oss-20b.

Strategy	Implementation	Savings
Prompt Caching	The User Profile & Taxonomy are identical across calls. `gpt-oss-20b` caches these prefixes.	~50%
Small Model Offloading	Simple "Stall" messages ("Wait...", "Hello?") are routed to `8B`.	~80%
Rate Limiter	`rate_limiter.py` enforces a max budget per session.	100% (Safety)

Claim	Reality	Status
"Uses Llama 3.3"	CONFIRMED	Primary for Persona generation.
"Uses OpenAI GPT-4"	FALLBACK ONLY	Mapped in `fallbacks` but primarily uses Groq (Llama) for speed.
"Auto-Switching"	REAL	`_switchboard()` function in `llm_client.py` handles this logic dynamically.