Prateek Bose

AI Solutions Architecture

Agentic AI Incident Resolution System

An enterprise AI platform that automates incident resolution workflows using watsonx Orchestrate, vector databases, and intelligent automation tools.

Overview

Intelligent SRE Assistant that coordinates specialized tools for automated incident resolution through a lightweight orchestration layer.

Core Approach: watsonx Orchestrate handles routing and flow control → External tools perform heavy computational tasks

Architecture

Orchestration: watsonx Orchestrate provides intent classification, tool routing, and flow control

Tools: MCP servers and REST services handle retrieval, planning, execution, and verification

System Components

1. Intake & Classification

Extracts incident context (service, environment, severity, symptoms, error codes) and routes to appropriate tools

2. Hybrid Retrieval

Primary: Milvus vector search across IBM documentation (top_k=5, threshold=0.82)

Fallback: FAISS for internal SOPs and automation catalog

3. Intelligent Planning

Agentic RAG validates preconditions, analyzes Instana data, and requests approvals for risky operations

4. Safe Execution

Ansible automation with mandatory dry-runs, RBAC enforcement, and automatic rollback capability

Hybrid RAG Strategy

HyDE + BM25 Retrieval:

  1. Parse incident symptoms and apply platform/environment filters
  2. Generate hypothetical documents (HyDE) and encode
  3. Dual retrieve: HyDE + BM25 keyword search
  4. Fuse results, validate preconditions, and re-rank
  5. Select best template based on composite score

Selection Criteria:

  • Platform/environment match (mandatory)
  • Preconditions satisfied (all must be true)
  • Risk level (prefer lowest)
  • Verification coverage (prefer highest)

Technology Stack

AI & Orchestration:

  • watsonx Orchestrate for agent coordination
  • watsonx.ai for LLM inference and planning
  • Milvus for vector similarity search (10M+ documents)
  • FAISS for internal SOPs and policies

Automation & Monitoring:

  • Ansible Automation Platform with RBAC
  • Instana APM for health verification
  • ServiceNow for ITSM/CMDB integration

Supported Platforms:

  • Operating Systems: RHEL, AIX, Windows Server
  • IBM Middleware: Db2, WebSphere, MQ, Tivoli, DataStage

Safety & Guardrails

Execution Controls:

  • Mandatory dry-run before implementation
  • RBAC enforcement at Ansible layer
  • Explicit approval for destructive operations
  • Production environment gates

Reliability:

  • Configurable timeouts (default 300s)
  • Idempotent playbook design
  • Mandatory rollback templates
  • Automated verification gates

Security:

  • mTLS for inter-service communication
  • Vault integration for credentials
  • Audit logging for all actions

Performance Targets

Response Times:

  • Retrieval operations: Less than 10s (P99)
  • Automated resolutions: Less than 20 min MTTR
  • System availability: 99.9%

Scalability:

  • 100 concurrent incidents
  • 1000 requests/minute per tenant
  • Milvus sharding for 10M+ documents

Workflow Example

Incident Resolution Flow:

  1. Submit incident with context and symptoms
  2. Retrieve similar past incidents from vector database
  3. Plan remediation strategy using AI reflection
  4. Approve medium/high-risk operations
  5. Execute dry-run, then actual playbook
  6. Verify health metrics via Instana
  7. Rollback automatically if degradation detected
  8. Update ServiceNow ticket with results

Key Features

Intelligent Context:

  • Historical incident similarity search
  • Platform-aware template selection
  • Environment-specific preconditions

Autonomous Operation:

  • Auto-resolve for low-risk incidents
  • Approval workflows for production
  • 24/7 incident response capability

Complete Audit Trail:

  • All actions logged for compliance
  • Execution artifacts persisted
  • ServiceNow integration for tickets

Results

Operational Impact:

  • Reduced MTTR for common incidents
  • Automated handling of repetitive tasks
  • 24/7 autonomous response capability

Safety & Reliability:

  • Approval gates prevent unauthorized changes
  • Validation ensures remediation effectiveness
  • Rollback minimizes risk of failures

Governance:

  • Complete audit trail for compliance
  • RBAC integration with enterprise identity
  • Automated documentation generation

Technologies Used

AI/ML: watsonx.ai, watsonx Orchestrate, Milvus, FAISS

Automation: Ansible Automation Platform, Python

Monitoring: Instana APM

Integration: ServiceNow, Model Context Protocol (MCP)

Infrastructure: Docker, Kubernetes, RHEL, AIX, Windows