Prateek Bose

AI Solutions Architecture

Agentic AI Incident Resolution System

Incident investigation across 8 diagnostic dimensions took SRE teams hours of manual SSH sessions. A single AI agent with 12 tools and parallel fault tree analysis resolves it in minutes.

The Problem

Every incident followed the same pattern: an SRE would SSH into a server, run diagnostic commands one by one, correlate outputs across systems, then repeat for each hypothesis. OS Resources, capacity, network, middleware, config changes, dependent services, security, scheduled jobs — each dimension investigated sequentially.

This required deep platform-specific knowledge. AIX uses errpt and svmon where RHEL uses journalctl and free. Db2 commands chain with && but never semicolons inside db2 "...". MQ diagnostics pipe through runmqsc. Each platform has its own syntax, failure patterns, and thresholds — knowledge trapped in individual engineers' heads.

No parallel execution, no reuse of investigation patterns, no structured correlation across findings. A single incident touching multiple middleware layers could take hours of manual work.

Architecture

The system runs as three independently deployed services. The SRE Deep Agents service hosts a Slack bot and a single ReAct agent. The SRE Tools API provides a FastAPI backend handling external integrations, vector search, and AI enrichment. The SSH Tool Server manages persistent SSH connection pools for high-performance remote command execution.

Key Decision

We chose a single agent over a supervisor-specialist architecture. A supervisor adds intent classification latency and introduces misrouting risk — if the classifier picks the wrong specialist, the entire response fails. With a single create_react_agent seeing all 12 tools, the LLM's own reasoning handles routing implicitly. The cost is a larger tool namespace per LLM call, but with 12 tools this is well within context limits. Debugging is simpler too — one agent trace vs. supervisor + specialist traces.

Single Agent Design

The agent is a LangGraph create_react_agent with all 12 tools available simultaneously. Intent is detected from message context — an INC number triggers investigation, server names trigger command execution, symptoms trigger troubleshooting, how-to questions trigger documentation search. No separate classifier, no handoffs.

ToolPurpose
write_todosPlan multi-step tasks (3+ steps)
load_skillLoad platform-specific runbook (RHEL, AIX, Db2, etc.)
execute_readonly_commandRun shell commands on remote servers (guardrailed)
get_system_reportDetect OS, middleware inventory, disk/memory
multi_server_fanoutSame command across up to 5 servers
research_documentationRAG search across IBM/Red Hat/internal docs
fetch_jira_issueJira issue details
fetch_servicenow_incidentServiceNow incident details
get_server_relationshipsCMDB upstream/downstream dependencies
get_instana_host_eventsLast 24h monitoring events
check_user_authorizationRBAC verification
run_investigationLaunch parallel FTA pipeline (8 branches)

Every server interaction follows the same workflow. The agent detects the OS, loads platform-specific skills, executes commands from the loaded runbook, and analyzes results before deciding whether more data is needed:

A pre-model hook (sre_pre_model_hook) runs before every LLM call, injecting 8 conditional context elements: auto-loaded skills, skills catalog, planning instructions, current todos, active skills reminder, think-after-commands guard, documentation guard, and repeated tool call prevention. A post-model hook (fix_text_tool_calls) handles Llama 4 Maverick's tendency to emit tool calls as JSON text rather than structured objects — extracting, converting, deduplicating, and filtering them.

Investigation Pipeline

When the agent detects an INC number, it triggers run_investigation — a fully autonomous parallel fault tree analysis pipeline. This is the only part of the system that uses sub-agents; they are internal to the investigation graph and invisible to the outer agent.

The prepare node gathers all context before branching: ServiceNow incident details, Instana monitoring events (last 24h), CMDB relationships, OS detection via system report, and platform-specific skill loading. A circuit breaker skips all branches if the server is unreachable (system report < 500 chars and OS undetectable).

Eight parallel branches then execute simultaneously, each a fully independent create_react_agent:

BranchCategoryDiagnostic Focus
H1OS Resourcesvmstat, iostat, top, df, free/svmon, sar
H2CapacityDisk usage, large files, swap analysis
H3Networknetstat/ss, ip addr, ping, traceroute, dig, firewall
H4MiddlewareLogs, connection pools, queue depths, thread dumps
H5Config Changesrpm -qa --last, find -newer, config diffs
H6DependenciesUpstream/downstream service health checks
H7SecurityAudit logs, failed logins, certificate expiry, SELinux
H8Scheduled JobsCron logs, TWS conman, scheduled task status

Each branch executes 2-3 chained command calls, analyzes outputs, and concludes with a status: CONFIRMED, ELIMINATED, or INCONCLUSIVE. A branch_loop_guard limits each to 4 unique commands to prevent infinite loops.

Anti-Hallucination

Branch conclusions are built from actual message history, not LLM claims. Commands are extracted from real tool_calls in AI messages, evidence from actual tool result messages. Messages containing "let's assume" or "hypothetical" are filtered out. If no real commands ran but the model "concluded," status is forced to INCONCLUSIVE.

The consolidate node aggregates all 8 branch conclusions, queries the vector store for relevant KB articles, and generates a 13-section RCA report: Executive Summary, Investigation Metadata, Environment Discovery, Fault Tree Summary, Detailed Findings, Correlation Analysis, Five Whys, Root Cause Determination, Recommended Actions, Investigation Gaps, KB References, Manual Intervention, and Lessons Learned.

Command Security

The system implements defense-in-depth with guardrails at two independent levels. Even if the agent-side guardrail is bypassed, the server-side guardrail enforces its own validation — no single point of failure.

Agent-side (3 layers): Blocked binaries denylist (60+ destructive commands) → dangerous shell patterns (30+ regex) → strict allowlist (opt-in, 100+ safe commands). Commands classified as safe, borderline (requires operator approval), or blocked.

Server-side (4 layers): Binary denylist (always on) → shell patterns (always on) → strict allowlist (opt-in) → LLM Guardian (opt-in, Llama Guard 3). The LLM Guardian classifies commands against 6 safety categories: destructive operations (S1), system state changes (S2), privilege escalation (S3), data exfiltration (S4), malicious payloads (S5), and resource exhaustion (S6).

The tradeoff: dual guardrails add ~10ms latency per command (agent-side validation) plus potential LLM Guardian latency (10s timeout) when enabled. This is negligible compared to SSH execution time, and the defense-in-depth prevents any single bypass from reaching production hosts.

Performance

SSH Connection Pool

The SSH Tool Server replaced per-request ansible-runner invocations with persistent asyncssh connections and channel multiplexing. One TCP connection runs many concurrent commands via separate SSH channels, eliminating per-request handshake overhead.

First bar: ansible-runner (legacy). Second bar: asyncssh pool (current). Connection overhead in ms, throughput in requests/second.

Pool sizing: 5 persistent connections per host × 10 channels per connection = 50 concurrent commands per host. Global cap of 200 connections. Connections recycled hourly (TTL), idle connections evicted after 5 minutes. Background health checker probes every 60 seconds.

Caching and Budgets

  • System report caching: get_system_report results cached per server FQDN with 5-minute TTL, preventing redundant calls when 8 investigation branches access the same host
  • Tool budgets: write_todos limited to 2 calls, research_documentation to 3 calls per invocation — prevents runaway loops
  • Ansible lock: asyncio.Lock() serializes SSH Tool Server calls during parallel investigation, preventing concurrent host access conflicts

Skills System

Skills are platform-specific SRE runbooks stored as SKILL.md files. The system uses progressive disclosure to manage context: the catalog shows only skill names and descriptions (~100 tokens per skill) until the agent explicitly calls load_skill, which reads the full content (~8K tokens) and caches it in agent state.

12 platform skills: RHEL, AIX, Windows Server, Db2, MQ, WebSphere, TWS, Spectrum Protect, MSSQL, DataStage, Instana, plus a universal response_formatting skill (auto-loaded).

Each skill contains:

  • Prohibited commands with 3 tiers: blocked, restricted, deceptive
  • Diagnostic commands mapped to each FTA branch (H1-H8)
  • Performance thresholds (normal vs. concerning metrics)
  • Error codes and platform-specific failure patterns
  • Cross-product correlation (e.g., Db2 lock → WAS thread hang → MQ queue backup)

The agent adapts command syntax based on detected OS: AIX uses df -g (not -h), errpt (not journalctl), svmon (not free), lssrc (not systemctl). Db2 commands chain with && and never use semicolons inside db2 "...". MQ diagnostics pipe through runmqsc.

Results

Investigation Speed:

  • 8 parallel diagnostic branches vs. sequential manual SSH sessions
  • Full 13-section RCA report generated autonomously
  • Prepare → branch → consolidate pipeline completes while an SRE would still be on the second hypothesis

Safety:

  • 7-layer combined guardrail (3 agent-side + 4 server-side) with no single point of failure
  • Read-only enforcement on all remote commands
  • Anti-hallucination: conclusions built from actual tool outputs, not LLM claims

Platform Coverage:

  • 12 platform-specific skills covering 3 OS families (RHEL, AIX, Windows)
  • 8+ middleware types (Db2, MQ, WAS, TWS, Spectrum Protect, MSSQL, DataStage, Instana)
  • Progressive skill loading keeps context under control (~100 tokens catalog vs. ~8K per loaded skill)

Command Execution:

  • 100x throughput improvement: asyncssh pool (1,000+ system req/s) vs. ansible-runner (5-10 req/s)
  • Under 10ms connection overhead per command vs. 1-3s per handshake
  • Zero temp disk I/O, no remote Python dependency

Technology Stack

AI and Orchestration: watsonx.ai (Llama 4 Maverick 17B-128E), LangGraph, LangChain

Vector Search: FAISS (local, sub-millisecond), Milvus (distributed, gRPC+TLS), semantic caching (0.87 cosine threshold, 24h TTL)

Command Execution: asyncssh (persistent pools, channel multiplexing), pywinrm (Windows/NTLM)

Integrations: ServiceNow (ITSM/CMDB), Instana (APM), Jira, Slack (Socket Mode)

Infrastructure: FastAPI, OpenShift (Kubernetes), systemd, SQLite/PostgreSQL

Key Design Decisions

DecisionRationaleTradeoff
Single agent over supervisorEliminates intent classification latency and misrouting. LLM sees all 12 tools and picks the right ones. Simpler debugging.Larger tool namespace per call, but 12 tools is within limits
run_investigation as a toolOuter agent stays simple — calls one tool, relays report. Branch agents are internal and invisible.Investigation pipeline is opaque to the outer agent
Dual guardrail (agent + server)Agent catches issues early (saves network round-trip). Server is authoritative — prevents bypass.~10ms added latency per command (negligible vs. SSH time)
asyncssh persistent poolsReplaces 1-3s ansible-runner handshake with under 10ms channel open. Single Uvicorn worker preserves in-process pool.Single worker limits vertical scaling; horizontal scaling via multiple hosts
Progressive skill disclosure12 skills × ~8K tokens = ~96K if all loaded. Catalog at ~100 tokens/skill, loaded on-demand.Extra tool call per skill load; agent must decide which skills to load
No checkpointer in Slack modeFull thread history fetched from Slack each invocation. Avoids state drift between bot memory and user-visible thread.Full thread fetch adds latency; no cross-thread memory