ZeroTsai0308 commited on
Commit
be00444
Β·
verified Β·
1 Parent(s): 56e0cee

docs: update README for 6-agent architecture with reporting_agent

Browse files
Files changed (1) hide show
  1. README.md +44 -22
README.md CHANGED
@@ -23,40 +23,45 @@ short_description: AI SRE Agent for incident analysis and RCA
23
 
24
  # πŸ”§ SRE Agent β€” AI-Powered Site Reliability Engineering
25
 
26
- An intelligent, multi-agent SRE system built with [smolagents](https://huggingface.co/docs/smolagents) that can investigate incidents, analyze time-series metrics, perform root cause analysis, parse logs, and generate incident reports.
27
 
28
  ## πŸ—οΈ Architecture
29
 
30
  ```
31
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
32
- β”‚ SRE Manager (CodeAgent) β”‚
33
- β”‚ Orchestrates the full incident investigation lifecycle β”‚
34
- β”‚ planning_interval=3 for periodic re-assessment β”‚
35
- β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
36
- β”‚ Metrics Agentβ”‚ Log Agent β”‚ RCA Agent β”‚ Infra Agent β”‚
37
- β”‚(ToolCalling) β”‚ (ToolCalling) β”‚(ToolCalling) β”‚ (ToolCalling) β”‚
38
- β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
39
- β”‚ β€’ Anomaly β”‚ β€’ Log Parser β”‚ β€’ Correlator β”‚ β€’ Resource Util. β”‚
40
- β”‚ Detector β”‚ β€’ Log Anomaly β”‚ β€’ Dependency β”‚ β€’ Health Checker β”‚
41
- β”‚ β€’ Forecasterβ”‚ Detector β”‚ Analyzer β”‚ β€’ Alert Summary β”‚
42
- β”‚ β€’ Correlatorβ”‚ β€’ Pattern β”‚ β€’ Change β”‚ β€’ SLO Checker β”‚
43
- β”‚ β€’ Stats β”‚ Extractor β”‚ Correlationβ”‚ β€’ Runbook Search β”‚
44
- β”‚ β”‚ β”‚ β€’ Incident β”‚ β”‚
45
- β”‚ β”‚ β”‚ Report Gen β”‚ β”‚
46
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 
 
47
  ```
48
 
 
 
49
  ### Design Principles (Literature-Backed)
50
 
51
  | Principle | Source | Implementation |
52
  |-----------|--------|----------------|
53
- | Multi-agent collaboration | [OpsAgent](https://arxiv.org/abs/2510.24145) | Manager CodeAgent + 4 specialist ToolCallingAgents |
54
  | Recursive trace traversal | [AMER-RCL](https://arxiv.org/abs/2601.02732) | ServiceDependencyAnalyzer with BFS topology walk |
55
  | Fast/Slow detection cascade | [CloudAnoBench](https://arxiv.org/abs/2508.01844) | Z-score (fast) β†’ Isolation Forest (slow) cascade |
56
  | Statistical pre-filter β†’ LLM | [RCACopilot](https://arxiv.org/abs/2507.03224) | Tools do statistical analysis, LLM interprets results |
57
  | Anti-stalling planning | [Reasoning Failure Taxonomy](https://arxiv.org/abs/2601.22208) | planning_interval=3, max_steps limits, cross-modal checks |
 
58
 
59
- ## 🧰 16 Specialized SRE Tools
60
 
61
  ### πŸ“Š Time-Series Analysis (4 tools)
62
  | Tool | Description |
@@ -73,13 +78,12 @@ An intelligent, multi-agent SRE system built with [smolagents](https://huggingfa
73
  | `log_anomaly_detector` | Template-based anomaly detection β€” finds new error patterns and frequency shifts |
74
  | `log_pattern_extractor` | Extracts error codes, exception types, service names, and key phrases for RCA |
75
 
76
- ### πŸ” Root Cause Analysis (4 tools)
77
  | Tool | Description |
78
  |------|-------------|
79
  | `rca_correlator` | Multi-signal correlation engine β€” temporal alignment, hypothesis generation, confidence scoring |
80
  | `service_dependency_analyzer` | Topology analysis with blast radius calculation, SPOFs, and investigation ordering |
81
  | `change_correlator` | Correlates deployments, config changes, and scaling events with incident timing |
82
- | `incident_report_generator` | Structured incident reports with timeline, impact assessment, and follow-up items |
83
 
84
  ### βš™οΈ Infrastructure & Alerting (5 tools)
85
  | Tool | Description |
@@ -90,6 +94,14 @@ An intelligent, multi-agent SRE system built with [smolagents](https://huggingfa
90
  | `resource_utilization` | Pod-level CPU/memory/disk/network metrics with aggregate health scoring |
91
  | `service_health_checker` | Comprehensive health check: endpoint, dependencies, metrics, deployment, certificates |
92
 
 
 
 
 
 
 
 
 
93
  ## πŸ’¬ Example Prompts
94
 
95
  Try these to see the agent in action:
@@ -112,6 +124,15 @@ Try these to see the agent in action:
112
  6. **Log analysis:**
113
  > "Parse the logs for the payment-service, focusing on ERROR and CRITICAL messages. Are there any error bursts?"
114
 
 
 
 
 
 
 
 
 
 
115
  ## πŸ”¬ Technical Details
116
 
117
  - **Agent Framework:** [smolagents](https://huggingface.co/docs/smolagents) v1.24+
@@ -119,6 +140,7 @@ Try these to see the agent in action:
119
  - **Anomaly Detection:** Z-score (Οƒ > 3) + Isolation Forest (contamination=0.05) with consensus scoring
120
  - **Forecasting:** Holt's Exponential Smoothing with configurable confidence intervals
121
  - **Correlation:** Pearson cross-correlation with lag analysis (max_lag configurable)
 
122
  - **Simulated Environment:** All tools support `'auto'` mode with realistic microservice data generation
123
 
124
  ## πŸ“š References
@@ -128,5 +150,5 @@ Try these to see the agent in action:
128
  - [CloudAnoBench: Context-Aware Anomaly Detection](https://arxiv.org/abs/2508.01844) β€” Fast/slow detection + symbolic verifier
129
  - [OpsAgent: Self-Evolving Multi-Agent](https://arxiv.org/abs/2510.24145) β€” Training-free data processor + multi-agent RCA
130
  - [LLM Reasoning Failures for RCA](https://arxiv.org/abs/2601.22208) β€” 16-category failure taxonomy
131
- - [ITBench: SRE Benchmark](https://arxiv.org/abs/2502.05352) β€” Real-world SRE scenario evaluation
132
  - [Time-Series Anomaly Detection Survey](https://arxiv.org/abs/2412.20512) β€” Comprehensive method taxonomy
 
23
 
24
  # πŸ”§ SRE Agent β€” AI-Powered Site Reliability Engineering
25
 
26
+ An intelligent, multi-agent SRE system built with [smolagents](https://huggingface.co/docs/smolagents) that can investigate incidents, analyze time-series metrics, perform root cause analysis, parse logs, generate incident reports, executive summaries, and weekly SRE reports.
27
 
28
  ## πŸ—οΈ Architecture
29
 
30
  ```
31
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
32
+ β”‚ SRE Manager (CodeAgent) β”‚
33
+ β”‚ Orchestrates the full incident investigation & reporting lifecycle β”‚
34
+ β”‚ planning_interval=3 for periodic re-assessment β”‚
35
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
36
+ β”‚ Metrics β”‚ Log Agent β”‚ RCA Agent β”‚ Infra Agent β”‚ Reporting Agent β”‚
37
+ β”‚ Agent β”‚ (ToolCalling)β”‚ (ToolCalling)β”‚ (ToolCalling)β”‚ (ToolCalling) β”‚
38
+ β”‚(ToolCall)β”‚ β”‚ β”‚ β”‚ β”‚
39
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
40
+ β”‚ β€’ Anomalyβ”‚ β€’ Log Parser β”‚ β€’ Correlator β”‚ β€’ Resource β”‚ β€’ Incident Report β”‚
41
+ β”‚ Detectorβ”‚ β€’ Log Anomalyβ”‚ β€’ Dependency β”‚ Util. β”‚ β€’ Executive Summary β”‚
42
+ β”‚ β€’ Fore- β”‚ Detector β”‚ Analyzer β”‚ β€’ Health β”‚ β€’ Weekly Report β”‚
43
+ β”‚ caster β”‚ β€’ Pattern β”‚ β€’ Change β”‚ Checker β”‚ β€’ Report Formatter β”‚
44
+ β”‚ β€’ Correl.β”‚ Extractor β”‚ Correlationβ”‚ β€’ Alert β”‚ (MD/JSON/HTML/ β”‚
45
+ β”‚ β€’ Stats β”‚ β”‚ β”‚ Summary β”‚ Slack/Summary) β”‚
46
+ β”‚ β”‚ β”‚ β”‚ β€’ SLO Checkerβ”‚ β”‚
47
+ β”‚ β”‚ β”‚ β”‚ β€’ Runbook β”‚ β”‚
48
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
49
  ```
50
 
51
+ **6 Agents** (1 Manager + 5 Workers) with **19 specialized tools**.
52
+
53
  ### Design Principles (Literature-Backed)
54
 
55
  | Principle | Source | Implementation |
56
  |-----------|--------|----------------|
57
+ | Multi-agent collaboration | [OpsAgent](https://arxiv.org/abs/2510.24145) | Manager CodeAgent + 5 specialist ToolCallingAgents |
58
  | Recursive trace traversal | [AMER-RCL](https://arxiv.org/abs/2601.02732) | ServiceDependencyAnalyzer with BFS topology walk |
59
  | Fast/Slow detection cascade | [CloudAnoBench](https://arxiv.org/abs/2508.01844) | Z-score (fast) β†’ Isolation Forest (slow) cascade |
60
  | Statistical pre-filter β†’ LLM | [RCACopilot](https://arxiv.org/abs/2507.03224) | Tools do statistical analysis, LLM interprets results |
61
  | Anti-stalling planning | [Reasoning Failure Taxonomy](https://arxiv.org/abs/2601.22208) | planning_interval=3, max_steps limits, cross-modal checks |
62
+ | Separation of concerns | SRP | Dedicated reporting agent separated from RCA and remediation |
63
 
64
+ ## 🧰 19 Specialized SRE Tools
65
 
66
  ### πŸ“Š Time-Series Analysis (4 tools)
67
  | Tool | Description |
 
78
  | `log_anomaly_detector` | Template-based anomaly detection β€” finds new error patterns and frequency shifts |
79
  | `log_pattern_extractor` | Extracts error codes, exception types, service names, and key phrases for RCA |
80
 
81
+ ### πŸ” Root Cause Analysis (3 tools)
82
  | Tool | Description |
83
  |------|-------------|
84
  | `rca_correlator` | Multi-signal correlation engine β€” temporal alignment, hypothesis generation, confidence scoring |
85
  | `service_dependency_analyzer` | Topology analysis with blast radius calculation, SPOFs, and investigation ordering |
86
  | `change_correlator` | Correlates deployments, config changes, and scaling events with incident timing |
 
87
 
88
  ### βš™οΈ Infrastructure & Alerting (5 tools)
89
  | Tool | Description |
 
94
  | `resource_utilization` | Pod-level CPU/memory/disk/network metrics with aggregate health scoring |
95
  | `service_health_checker` | Comprehensive health check: endpoint, dependencies, metrics, deployment, certificates |
96
 
97
+ ### πŸ“„ Reporting (4 tools) β€” NEW
98
+ | Tool | Description |
99
+ |------|-------------|
100
+ | `incident_report_generator` | Structured incident post-mortem with timeline, RCA, impact assessment, and follow-up items |
101
+ | `executive_summary` | Concise business-impact summaries for VP/C-level/board audiences |
102
+ | `sre_weekly_report` | Comprehensive weekly operational reports: SLO trends, incidents, MTTR, alert noise, deployments, capacity forecasting, and AI-generated recommendations |
103
+ | `report_formatter` | Multi-format output: Markdown, JSON, HTML, Slack mrkdwn, or TL;DR summary |
104
+
105
  ## πŸ’¬ Example Prompts
106
 
107
  Try these to see the agent in action:
 
124
  6. **Log analysis:**
125
  > "Parse the logs for the payment-service, focusing on ERROR and CRITICAL messages. Are there any error bursts?"
126
 
127
+ 7. **Executive summary:** (NEW)
128
+ > "Generate an executive summary of the current incident for the VP of Engineering."
129
+
130
+ 8. **Weekly SRE report:** (NEW)
131
+ > "Generate the SRE weekly report for this week, including SLO compliance, incidents, alert noise analysis, and capacity forecasting."
132
+
133
+ 9. **Multi-format report:** (NEW)
134
+ > "Generate an incident report for the payment-service outage and format it as HTML for the internal wiki."
135
+
136
  ## πŸ”¬ Technical Details
137
 
138
  - **Agent Framework:** [smolagents](https://huggingface.co/docs/smolagents) v1.24+
 
140
  - **Anomaly Detection:** Z-score (Οƒ > 3) + Isolation Forest (contamination=0.05) with consensus scoring
141
  - **Forecasting:** Holt's Exponential Smoothing with configurable confidence intervals
142
  - **Correlation:** Pearson cross-correlation with lag analysis (max_lag configurable)
143
+ - **Report Formats:** Markdown, JSON, HTML, Slack mrkdwn, TL;DR summary
144
  - **Simulated Environment:** All tools support `'auto'` mode with realistic microservice data generation
145
 
146
  ## πŸ“š References
 
150
  - [CloudAnoBench: Context-Aware Anomaly Detection](https://arxiv.org/abs/2508.01844) β€” Fast/slow detection + symbolic verifier
151
  - [OpsAgent: Self-Evolving Multi-Agent](https://arxiv.org/abs/2510.24145) β€” Training-free data processor + multi-agent RCA
152
  - [LLM Reasoning Failures for RCA](https://arxiv.org/abs/2601.22208) β€” 16-category failure taxonomy
153
+ - [RCACopilot: Root Cause Analysis](https://arxiv.org/abs/2507.03224) β€” Statistical pre-filter + LLM reasoning
154
  - [Time-Series Anomaly Detection Survey](https://arxiv.org/abs/2412.20512) β€” Comprehensive method taxonomy