Spaces:
Sleeping
Sleeping
docs: update README for 6-agent architecture with reporting_agent
Browse files
README.md
CHANGED
|
@@ -23,40 +23,45 @@ short_description: AI SRE Agent for incident analysis and RCA
|
|
| 23 |
|
| 24 |
# π§ SRE Agent β AI-Powered Site Reliability Engineering
|
| 25 |
|
| 26 |
-
An intelligent, multi-agent SRE system built with [smolagents](https://huggingface.co/docs/smolagents) that can investigate incidents, analyze time-series metrics, perform root cause analysis, parse logs,
|
| 27 |
|
| 28 |
## ποΈ Architecture
|
| 29 |
|
| 30 |
```
|
| 31 |
-
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 32 |
-
β
|
| 33 |
-
β
|
| 34 |
-
β
|
| 35 |
-
ββββββββββββββ
|
| 36 |
-
β Metrics
|
| 37 |
-
β(ToolCalling)
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
β
|
| 41 |
-
β β’
|
| 42 |
-
β β’
|
| 43 |
-
β β’
|
| 44 |
-
β
|
| 45 |
-
β
|
| 46 |
-
|
|
|
|
|
|
|
| 47 |
```
|
| 48 |
|
|
|
|
|
|
|
| 49 |
### Design Principles (Literature-Backed)
|
| 50 |
|
| 51 |
| Principle | Source | Implementation |
|
| 52 |
|-----------|--------|----------------|
|
| 53 |
-
| Multi-agent collaboration | [OpsAgent](https://arxiv.org/abs/2510.24145) | Manager CodeAgent +
|
| 54 |
| Recursive trace traversal | [AMER-RCL](https://arxiv.org/abs/2601.02732) | ServiceDependencyAnalyzer with BFS topology walk |
|
| 55 |
| Fast/Slow detection cascade | [CloudAnoBench](https://arxiv.org/abs/2508.01844) | Z-score (fast) β Isolation Forest (slow) cascade |
|
| 56 |
| Statistical pre-filter β LLM | [RCACopilot](https://arxiv.org/abs/2507.03224) | Tools do statistical analysis, LLM interprets results |
|
| 57 |
| Anti-stalling planning | [Reasoning Failure Taxonomy](https://arxiv.org/abs/2601.22208) | planning_interval=3, max_steps limits, cross-modal checks |
|
|
|
|
| 58 |
|
| 59 |
-
## π§°
|
| 60 |
|
| 61 |
### π Time-Series Analysis (4 tools)
|
| 62 |
| Tool | Description |
|
|
@@ -73,13 +78,12 @@ An intelligent, multi-agent SRE system built with [smolagents](https://huggingfa
|
|
| 73 |
| `log_anomaly_detector` | Template-based anomaly detection β finds new error patterns and frequency shifts |
|
| 74 |
| `log_pattern_extractor` | Extracts error codes, exception types, service names, and key phrases for RCA |
|
| 75 |
|
| 76 |
-
### π Root Cause Analysis (
|
| 77 |
| Tool | Description |
|
| 78 |
|------|-------------|
|
| 79 |
| `rca_correlator` | Multi-signal correlation engine β temporal alignment, hypothesis generation, confidence scoring |
|
| 80 |
| `service_dependency_analyzer` | Topology analysis with blast radius calculation, SPOFs, and investigation ordering |
|
| 81 |
| `change_correlator` | Correlates deployments, config changes, and scaling events with incident timing |
|
| 82 |
-
| `incident_report_generator` | Structured incident reports with timeline, impact assessment, and follow-up items |
|
| 83 |
|
| 84 |
### βοΈ Infrastructure & Alerting (5 tools)
|
| 85 |
| Tool | Description |
|
|
@@ -90,6 +94,14 @@ An intelligent, multi-agent SRE system built with [smolagents](https://huggingfa
|
|
| 90 |
| `resource_utilization` | Pod-level CPU/memory/disk/network metrics with aggregate health scoring |
|
| 91 |
| `service_health_checker` | Comprehensive health check: endpoint, dependencies, metrics, deployment, certificates |
|
| 92 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
## π¬ Example Prompts
|
| 94 |
|
| 95 |
Try these to see the agent in action:
|
|
@@ -112,6 +124,15 @@ Try these to see the agent in action:
|
|
| 112 |
6. **Log analysis:**
|
| 113 |
> "Parse the logs for the payment-service, focusing on ERROR and CRITICAL messages. Are there any error bursts?"
|
| 114 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
## π¬ Technical Details
|
| 116 |
|
| 117 |
- **Agent Framework:** [smolagents](https://huggingface.co/docs/smolagents) v1.24+
|
|
@@ -119,6 +140,7 @@ Try these to see the agent in action:
|
|
| 119 |
- **Anomaly Detection:** Z-score (Ο > 3) + Isolation Forest (contamination=0.05) with consensus scoring
|
| 120 |
- **Forecasting:** Holt's Exponential Smoothing with configurable confidence intervals
|
| 121 |
- **Correlation:** Pearson cross-correlation with lag analysis (max_lag configurable)
|
|
|
|
| 122 |
- **Simulated Environment:** All tools support `'auto'` mode with realistic microservice data generation
|
| 123 |
|
| 124 |
## π References
|
|
@@ -128,5 +150,5 @@ Try these to see the agent in action:
|
|
| 128 |
- [CloudAnoBench: Context-Aware Anomaly Detection](https://arxiv.org/abs/2508.01844) β Fast/slow detection + symbolic verifier
|
| 129 |
- [OpsAgent: Self-Evolving Multi-Agent](https://arxiv.org/abs/2510.24145) β Training-free data processor + multi-agent RCA
|
| 130 |
- [LLM Reasoning Failures for RCA](https://arxiv.org/abs/2601.22208) β 16-category failure taxonomy
|
| 131 |
-
- [
|
| 132 |
- [Time-Series Anomaly Detection Survey](https://arxiv.org/abs/2412.20512) β Comprehensive method taxonomy
|
|
|
|
| 23 |
|
| 24 |
# π§ SRE Agent β AI-Powered Site Reliability Engineering
|
| 25 |
|
| 26 |
+
An intelligent, multi-agent SRE system built with [smolagents](https://huggingface.co/docs/smolagents) that can investigate incidents, analyze time-series metrics, perform root cause analysis, parse logs, generate incident reports, executive summaries, and weekly SRE reports.
|
| 27 |
|
| 28 |
## ποΈ Architecture
|
| 29 |
|
| 30 |
```
|
| 31 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 32 |
+
β SRE Manager (CodeAgent) β
|
| 33 |
+
β Orchestrates the full incident investigation & reporting lifecycle β
|
| 34 |
+
β planning_interval=3 for periodic re-assessment β
|
| 35 |
+
ββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββββ€
|
| 36 |
+
β Metrics β Log Agent β RCA Agent β Infra Agent β Reporting Agent β
|
| 37 |
+
β Agent β (ToolCalling)β (ToolCalling)β (ToolCalling)β (ToolCalling) β
|
| 38 |
+
β(ToolCall)β β β β β
|
| 39 |
+
ββββββββββββΌβββββββββββββββΌβββββββββββββββΌβββββββββββββββΌβββββββββββββββββββββββ€
|
| 40 |
+
β β’ Anomalyβ β’ Log Parser β β’ Correlator β β’ Resource β β’ Incident Report β
|
| 41 |
+
β Detectorβ β’ Log Anomalyβ β’ Dependency β Util. β β’ Executive Summary β
|
| 42 |
+
β β’ Fore- β Detector β Analyzer β β’ Health β β’ Weekly Report β
|
| 43 |
+
β caster β β’ Pattern β β’ Change β Checker β β’ Report Formatter β
|
| 44 |
+
β β’ Correl.β Extractor β Correlationβ β’ Alert β (MD/JSON/HTML/ β
|
| 45 |
+
β β’ Stats β β β Summary β Slack/Summary) β
|
| 46 |
+
β β β β β’ SLO Checkerβ β
|
| 47 |
+
β β β β β’ Runbook β β
|
| 48 |
+
ββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββββ
|
| 49 |
```
|
| 50 |
|
| 51 |
+
**6 Agents** (1 Manager + 5 Workers) with **19 specialized tools**.
|
| 52 |
+
|
| 53 |
### Design Principles (Literature-Backed)
|
| 54 |
|
| 55 |
| Principle | Source | Implementation |
|
| 56 |
|-----------|--------|----------------|
|
| 57 |
+
| Multi-agent collaboration | [OpsAgent](https://arxiv.org/abs/2510.24145) | Manager CodeAgent + 5 specialist ToolCallingAgents |
|
| 58 |
| Recursive trace traversal | [AMER-RCL](https://arxiv.org/abs/2601.02732) | ServiceDependencyAnalyzer with BFS topology walk |
|
| 59 |
| Fast/Slow detection cascade | [CloudAnoBench](https://arxiv.org/abs/2508.01844) | Z-score (fast) β Isolation Forest (slow) cascade |
|
| 60 |
| Statistical pre-filter β LLM | [RCACopilot](https://arxiv.org/abs/2507.03224) | Tools do statistical analysis, LLM interprets results |
|
| 61 |
| Anti-stalling planning | [Reasoning Failure Taxonomy](https://arxiv.org/abs/2601.22208) | planning_interval=3, max_steps limits, cross-modal checks |
|
| 62 |
+
| Separation of concerns | SRP | Dedicated reporting agent separated from RCA and remediation |
|
| 63 |
|
| 64 |
+
## π§° 19 Specialized SRE Tools
|
| 65 |
|
| 66 |
### π Time-Series Analysis (4 tools)
|
| 67 |
| Tool | Description |
|
|
|
|
| 78 |
| `log_anomaly_detector` | Template-based anomaly detection β finds new error patterns and frequency shifts |
|
| 79 |
| `log_pattern_extractor` | Extracts error codes, exception types, service names, and key phrases for RCA |
|
| 80 |
|
| 81 |
+
### π Root Cause Analysis (3 tools)
|
| 82 |
| Tool | Description |
|
| 83 |
|------|-------------|
|
| 84 |
| `rca_correlator` | Multi-signal correlation engine β temporal alignment, hypothesis generation, confidence scoring |
|
| 85 |
| `service_dependency_analyzer` | Topology analysis with blast radius calculation, SPOFs, and investigation ordering |
|
| 86 |
| `change_correlator` | Correlates deployments, config changes, and scaling events with incident timing |
|
|
|
|
| 87 |
|
| 88 |
### βοΈ Infrastructure & Alerting (5 tools)
|
| 89 |
| Tool | Description |
|
|
|
|
| 94 |
| `resource_utilization` | Pod-level CPU/memory/disk/network metrics with aggregate health scoring |
|
| 95 |
| `service_health_checker` | Comprehensive health check: endpoint, dependencies, metrics, deployment, certificates |
|
| 96 |
|
| 97 |
+
### π Reporting (4 tools) β NEW
|
| 98 |
+
| Tool | Description |
|
| 99 |
+
|------|-------------|
|
| 100 |
+
| `incident_report_generator` | Structured incident post-mortem with timeline, RCA, impact assessment, and follow-up items |
|
| 101 |
+
| `executive_summary` | Concise business-impact summaries for VP/C-level/board audiences |
|
| 102 |
+
| `sre_weekly_report` | Comprehensive weekly operational reports: SLO trends, incidents, MTTR, alert noise, deployments, capacity forecasting, and AI-generated recommendations |
|
| 103 |
+
| `report_formatter` | Multi-format output: Markdown, JSON, HTML, Slack mrkdwn, or TL;DR summary |
|
| 104 |
+
|
| 105 |
## π¬ Example Prompts
|
| 106 |
|
| 107 |
Try these to see the agent in action:
|
|
|
|
| 124 |
6. **Log analysis:**
|
| 125 |
> "Parse the logs for the payment-service, focusing on ERROR and CRITICAL messages. Are there any error bursts?"
|
| 126 |
|
| 127 |
+
7. **Executive summary:** (NEW)
|
| 128 |
+
> "Generate an executive summary of the current incident for the VP of Engineering."
|
| 129 |
+
|
| 130 |
+
8. **Weekly SRE report:** (NEW)
|
| 131 |
+
> "Generate the SRE weekly report for this week, including SLO compliance, incidents, alert noise analysis, and capacity forecasting."
|
| 132 |
+
|
| 133 |
+
9. **Multi-format report:** (NEW)
|
| 134 |
+
> "Generate an incident report for the payment-service outage and format it as HTML for the internal wiki."
|
| 135 |
+
|
| 136 |
## π¬ Technical Details
|
| 137 |
|
| 138 |
- **Agent Framework:** [smolagents](https://huggingface.co/docs/smolagents) v1.24+
|
|
|
|
| 140 |
- **Anomaly Detection:** Z-score (Ο > 3) + Isolation Forest (contamination=0.05) with consensus scoring
|
| 141 |
- **Forecasting:** Holt's Exponential Smoothing with configurable confidence intervals
|
| 142 |
- **Correlation:** Pearson cross-correlation with lag analysis (max_lag configurable)
|
| 143 |
+
- **Report Formats:** Markdown, JSON, HTML, Slack mrkdwn, TL;DR summary
|
| 144 |
- **Simulated Environment:** All tools support `'auto'` mode with realistic microservice data generation
|
| 145 |
|
| 146 |
## π References
|
|
|
|
| 150 |
- [CloudAnoBench: Context-Aware Anomaly Detection](https://arxiv.org/abs/2508.01844) β Fast/slow detection + symbolic verifier
|
| 151 |
- [OpsAgent: Self-Evolving Multi-Agent](https://arxiv.org/abs/2510.24145) β Training-free data processor + multi-agent RCA
|
| 152 |
- [LLM Reasoning Failures for RCA](https://arxiv.org/abs/2601.22208) β 16-category failure taxonomy
|
| 153 |
+
- [RCACopilot: Root Cause Analysis](https://arxiv.org/abs/2507.03224) β Statistical pre-filter + LLM reasoning
|
| 154 |
- [Time-Series Anomaly Detection Survey](https://arxiv.org/abs/2412.20512) β Comprehensive method taxonomy
|