Agentic AI Incident Resolution System
Incident investigation across 8 diagnostic dimensions took SRE teams hours of manual SSH sessions. A single AI agent with 12 tools and parallel fault tree analysis resolves it in minutes.
The Problem
Every incident followed the same pattern: an SRE would SSH into a server, run diagnostic commands one by one, correlate outputs across systems, then repeat for each hypothesis. OS Resources, capacity, network, middleware, config changes, dependent services, security, scheduled jobs — each dimension investigated sequentially.
This required deep platform-specific knowledge. AIX uses errpt and svmon where RHEL uses journalctl and free. Db2 commands chain with && but never semicolons inside db2 "...". MQ diagnostics pipe through runmqsc. Each platform has its own syntax, failure patterns, and thresholds — knowledge trapped in individual engineers' heads.
No parallel execution, no reuse of investigation patterns, no structured correlation across findings. A single incident touching multiple middleware layers could take hours of manual work.
Architecture
The system runs as three independently deployed services. The SRE Deep Agents service hosts a Slack bot and a single ReAct agent. The SRE Tools API provides a FastAPI backend handling external integrations, vector search, and AI enrichment. The SSH Tool Server manages persistent SSH connection pools for high-performance remote command execution.
Key Decision
We chose a single agent over a supervisor-specialist architecture. A supervisor adds intent classification latency and introduces misrouting risk — if the classifier picks the wrong specialist, the entire response fails. With a single create_react_agent seeing all 12 tools, the LLM's own reasoning handles routing implicitly. The cost is a larger tool namespace per LLM call, but with 12 tools this is well within context limits. Debugging is simpler too — one agent trace vs. supervisor + specialist traces.
Single Agent Design
The agent is a LangGraph create_react_agent with all 12 tools available simultaneously. Intent is detected from message context — an INC number triggers investigation, server names trigger command execution, symptoms trigger troubleshooting, how-to questions trigger documentation search. No separate classifier, no handoffs.
| Tool | Purpose |
|---|---|
write_todos | Plan multi-step tasks (3+ steps) |
load_skill | Load platform-specific runbook (RHEL, AIX, Db2, etc.) |
execute_readonly_command | Run shell commands on remote servers (guardrailed) |
get_system_report | Detect OS, middleware inventory, disk/memory |
multi_server_fanout | Same command across up to 5 servers |
research_documentation | RAG search across IBM/Red Hat/internal docs |
fetch_jira_issue | Jira issue details |
fetch_servicenow_incident | ServiceNow incident details |
get_server_relationships | CMDB upstream/downstream dependencies |
get_instana_host_events | Last 24h monitoring events |
check_user_authorization | RBAC verification |
run_investigation | Launch parallel FTA pipeline (8 branches) |
Every server interaction follows the same workflow. The agent detects the OS, loads platform-specific skills, executes commands from the loaded runbook, and analyzes results before deciding whether more data is needed:
A pre-model hook (sre_pre_model_hook) runs before every LLM call, injecting 8 conditional context elements: auto-loaded skills, skills catalog, planning instructions, current todos, active skills reminder, think-after-commands guard, documentation guard, and repeated tool call prevention. A post-model hook (fix_text_tool_calls) handles Llama 4 Maverick's tendency to emit tool calls as JSON text rather than structured objects — extracting, converting, deduplicating, and filtering them.
Investigation Pipeline
When the agent detects an INC number, it triggers run_investigation — a fully autonomous parallel fault tree analysis pipeline. This is the only part of the system that uses sub-agents; they are internal to the investigation graph and invisible to the outer agent.
The prepare node gathers all context before branching: ServiceNow incident details, Instana monitoring events (last 24h), CMDB relationships, OS detection via system report, and platform-specific skill loading. A circuit breaker skips all branches if the server is unreachable (system report < 500 chars and OS undetectable).
Eight parallel branches then execute simultaneously, each a fully independent create_react_agent:
| Branch | Category | Diagnostic Focus |
|---|---|---|
| H1 | OS Resources | vmstat, iostat, top, df, free/svmon, sar |
| H2 | Capacity | Disk usage, large files, swap analysis |
| H3 | Network | netstat/ss, ip addr, ping, traceroute, dig, firewall |
| H4 | Middleware | Logs, connection pools, queue depths, thread dumps |
| H5 | Config Changes | rpm -qa --last, find -newer, config diffs |
| H6 | Dependencies | Upstream/downstream service health checks |
| H7 | Security | Audit logs, failed logins, certificate expiry, SELinux |
| H8 | Scheduled Jobs | Cron logs, TWS conman, scheduled task status |
Each branch executes 2-3 chained command calls, analyzes outputs, and concludes with a status: CONFIRMED, ELIMINATED, or INCONCLUSIVE. A branch_loop_guard limits each to 4 unique commands to prevent infinite loops.
Anti-Hallucination
Branch conclusions are built from actual message history, not LLM claims. Commands are extracted from real tool_calls in AI messages, evidence from actual tool result messages. Messages containing "let's assume" or "hypothetical" are filtered out. If no real commands ran but the model "concluded," status is forced to INCONCLUSIVE.
The consolidate node aggregates all 8 branch conclusions, queries the vector store for relevant KB articles, and generates a 13-section RCA report: Executive Summary, Investigation Metadata, Environment Discovery, Fault Tree Summary, Detailed Findings, Correlation Analysis, Five Whys, Root Cause Determination, Recommended Actions, Investigation Gaps, KB References, Manual Intervention, and Lessons Learned.
Command Security
The system implements defense-in-depth with guardrails at two independent levels. Even if the agent-side guardrail is bypassed, the server-side guardrail enforces its own validation — no single point of failure.
Agent-side (3 layers): Blocked binaries denylist (60+ destructive commands) → dangerous shell patterns (30+ regex) → strict allowlist (opt-in, 100+ safe commands). Commands classified as safe, borderline (requires operator approval), or blocked.
Server-side (4 layers): Binary denylist (always on) → shell patterns (always on) → strict allowlist (opt-in) → LLM Guardian (opt-in, Llama Guard 3). The LLM Guardian classifies commands against 6 safety categories: destructive operations (S1), system state changes (S2), privilege escalation (S3), data exfiltration (S4), malicious payloads (S5), and resource exhaustion (S6).
The tradeoff: dual guardrails add ~10ms latency per command (agent-side validation) plus potential LLM Guardian latency (10s timeout) when enabled. This is negligible compared to SSH execution time, and the defense-in-depth prevents any single bypass from reaching production hosts.
Performance
SSH Connection Pool
The SSH Tool Server replaced per-request ansible-runner invocations with persistent asyncssh connections and channel multiplexing. One TCP connection runs many concurrent commands via separate SSH channels, eliminating per-request handshake overhead.
First bar: ansible-runner (legacy). Second bar: asyncssh pool (current). Connection overhead in ms, throughput in requests/second.
Pool sizing: 5 persistent connections per host × 10 channels per connection = 50 concurrent commands per host. Global cap of 200 connections. Connections recycled hourly (TTL), idle connections evicted after 5 minutes. Background health checker probes every 60 seconds.
Caching and Budgets
- System report caching:
get_system_reportresults cached per server FQDN with 5-minute TTL, preventing redundant calls when 8 investigation branches access the same host - Tool budgets:
write_todoslimited to 2 calls,research_documentationto 3 calls per invocation — prevents runaway loops - Ansible lock:
asyncio.Lock()serializes SSH Tool Server calls during parallel investigation, preventing concurrent host access conflicts
Skills System
Skills are platform-specific SRE runbooks stored as SKILL.md files. The system uses progressive disclosure to manage context: the catalog shows only skill names and descriptions (~100 tokens per skill) until the agent explicitly calls load_skill, which reads the full content (~8K tokens) and caches it in agent state.
12 platform skills: RHEL, AIX, Windows Server, Db2, MQ, WebSphere, TWS, Spectrum Protect, MSSQL, DataStage, Instana, plus a universal response_formatting skill (auto-loaded).
Each skill contains:
- Prohibited commands with 3 tiers: blocked, restricted, deceptive
- Diagnostic commands mapped to each FTA branch (H1-H8)
- Performance thresholds (normal vs. concerning metrics)
- Error codes and platform-specific failure patterns
- Cross-product correlation (e.g., Db2 lock → WAS thread hang → MQ queue backup)
The agent adapts command syntax based on detected OS: AIX uses df -g (not -h), errpt (not journalctl), svmon (not free), lssrc (not systemctl). Db2 commands chain with && and never use semicolons inside db2 "...". MQ diagnostics pipe through runmqsc.
Results
Investigation Speed:
- 8 parallel diagnostic branches vs. sequential manual SSH sessions
- Full 13-section RCA report generated autonomously
- Prepare → branch → consolidate pipeline completes while an SRE would still be on the second hypothesis
Safety:
- 7-layer combined guardrail (3 agent-side + 4 server-side) with no single point of failure
- Read-only enforcement on all remote commands
- Anti-hallucination: conclusions built from actual tool outputs, not LLM claims
Platform Coverage:
- 12 platform-specific skills covering 3 OS families (RHEL, AIX, Windows)
- 8+ middleware types (Db2, MQ, WAS, TWS, Spectrum Protect, MSSQL, DataStage, Instana)
- Progressive skill loading keeps context under control (~100 tokens catalog vs. ~8K per loaded skill)
Command Execution:
- 100x throughput improvement: asyncssh pool (1,000+ system req/s) vs. ansible-runner (5-10 req/s)
- Under 10ms connection overhead per command vs. 1-3s per handshake
- Zero temp disk I/O, no remote Python dependency
Technology Stack
AI and Orchestration: watsonx.ai (Llama 4 Maverick 17B-128E), LangGraph, LangChain
Vector Search: FAISS (local, sub-millisecond), Milvus (distributed, gRPC+TLS), semantic caching (0.87 cosine threshold, 24h TTL)
Command Execution: asyncssh (persistent pools, channel multiplexing), pywinrm (Windows/NTLM)
Integrations: ServiceNow (ITSM/CMDB), Instana (APM), Jira, Slack (Socket Mode)
Infrastructure: FastAPI, OpenShift (Kubernetes), systemd, SQLite/PostgreSQL
Key Design Decisions
| Decision | Rationale | Tradeoff |
|---|---|---|
| Single agent over supervisor | Eliminates intent classification latency and misrouting. LLM sees all 12 tools and picks the right ones. Simpler debugging. | Larger tool namespace per call, but 12 tools is within limits |
run_investigation as a tool | Outer agent stays simple — calls one tool, relays report. Branch agents are internal and invisible. | Investigation pipeline is opaque to the outer agent |
| Dual guardrail (agent + server) | Agent catches issues early (saves network round-trip). Server is authoritative — prevents bypass. | ~10ms added latency per command (negligible vs. SSH time) |
| asyncssh persistent pools | Replaces 1-3s ansible-runner handshake with under 10ms channel open. Single Uvicorn worker preserves in-process pool. | Single worker limits vertical scaling; horizontal scaling via multiple hosts |
| Progressive skill disclosure | 12 skills × ~8K tokens = ~96K if all loaded. Catalog at ~100 tokens/skill, loaded on-demand. | Extra tool call per skill load; agent must decide which skills to load |
| No checkpointer in Slack mode | Full thread history fetched from Slack each invocation. Avoids state drift between bot memory and user-visible thread. | Full thread fetch adds latency; no cross-thread memory |