ML-Powered Incident Response Automation for Kubernetes

Incident response in Kubernetes environments follows a predictable pattern: an alert fires, an on-call engineer wakes up, spends 15 minutes understanding the context, runs through a mental (or written) runbook, executes remediation steps, and writes a postmortem. Most of these steps are automatable — yet most teams still handle them manually, losing precious minutes during every incident.

At Kustode, we built an incident response platform that automates the detection-to-resolution pipeline for our 12+ microservice healthcare platform. The system uses ML for incident classification and severity assessment, configurable runbooks for automated remediation, and intelligent escalation when automation isn't confident enough to act alone.

The Architecture: Five Integrated Components

The platform is designed as a pipeline where each stage adds intelligence to raw alerts:

Alert Sources                    Ingestion            ML Orchestrator
┌────────────┐                ┌──────────────┐     ┌──────────────────┐
│ Prometheus  │───webhook────▶│ Normalizer   │────▶│ Classification   │
│ AlertManager│               │ + Validator  │     │ Severity         │
│ Loki        │               └──────────────┘     │ Root Cause       │
│ Custom      │                                    │ Runbook Selection│
└────────────┘                                     └────────┬─────────┘
                                                            │
                                                   ┌────────▼─────────┐
                                                   │  Runbook Engine   │
                                                   │  K8s actions      │
                                                   │  HTTP calls       │
                                                   │  Scripts          │
                                                   │  Notifications    │
                                                   │  Manual approval  │
                                                   └────────┬─────────┘
                                                            │
                                                   ┌────────▼─────────┐
                                                   │  Integrations     │
                                                   │  Slack │ PagerDuty│
                                                   │  K8s   │ Prometheus│
                                                   └──────────────────┘

1. Alert Ingestion and Normalization

Alerts arrive from heterogeneous sources — Prometheus AlertManager webhooks, Loki log-based alerts, and custom application health checks. The ingestion layer normalizes them into a common schema:

Source identification — Which monitoring system generated the alert, what service it relates to, and which tenant (in our multi-tenant platform) is affected.
Context enrichment — The normalizer enriches raw alerts with Kubernetes metadata (pod names, node names, deployment versions, recent events) and recent metric history (was this service already degraded?).
Deduplication — Flapping alerts and duplicate notifications are merged into a single incident. A pod crash-looping generates many alerts but should create one incident.

2. ML-Powered Classification

Once normalized, incidents pass through an ML orchestrator that makes three decisions:

Classification — What type of incident is this? Categories include infrastructure (node failure, disk pressure), application (OOM kill, crash loop, high latency), data (pipeline failure, ingestion delay), and security (unauthorized access, certificate expiry). Classification uses an XGBoost model trained on historical incident data, with features extracted from the alert content, affected service, time of day, and recent deployment history.

Severity assessment — How urgent is this? Severity considers blast radius (how many services/tenants are affected), customer impact (is this blocking billing operations?), and historical escalation patterns (incidents like this one usually get escalated within 15 minutes).

Root cause hypothesis — What likely caused this? For well-known patterns (e.g., OOM kills after a deployment, high latency correlated with database connection pool exhaustion), the system generates a root cause hypothesis with confidence score. Low-confidence hypotheses are presented as suggestions rather than conclusions.

3. Runbook Engine

The runbook engine is the remediation layer. Runbooks are defined as YAML configurations with typed steps:

name: high-latency-database-connection-pool
trigger:
  classification: application.high_latency
  service_pattern: "claims-*|billing-*"
  condition: "metrics.p99_latency > 2000ms"

steps:
  - type: prometheus_query
    query: 'pg_stat_activity_count{service="{{ service }}"}'
    save_as: active_connections

  - type: conditional
    condition: "active_connections > pool_max * 0.9"
    if_true:
      - type: kubernetes
        action: rollout_restart
        target: "deployment/{{ service }}"
        namespace: "{{ namespace }}"
      - type: notification
        channel: slack
        message: "Restarted {{ service }} due to connection pool exhaustion"
    if_false:
      - type: notification
        channel: slack
        message: "High latency on {{ service }} but connection pool is healthy. Manual investigation needed."
        escalate: true

  - type: prometheus_query
    query: 'http_request_duration_seconds{service="{{ service }}", quantile="0.99"}'
    wait: 120s
    assert: "value < 1000"
    on_failure:
      - type: notification
        channel: pagerduty
        severity: high

Runbook steps support multiple action types:

Kubernetes actions — Scale deployments, restart pods, cordon nodes, apply resource limits.
HTTP calls — Hit health endpoints, trigger webhooks, call external APIs.
Prometheus/Loki queries — Gather diagnostic data to inform subsequent steps.
Scripts — Run custom remediation scripts in a sandboxed environment.
Notifications — Slack, PagerDuty, email with structured incident context.
Manual approval — Pause execution and wait for human confirmation before high-risk actions (e.g., database failover).
Parallel execution — Run independent diagnostic steps simultaneously to reduce time-to-resolution.

4. Feedback Loop

Every automated response feeds back into the ML models. When a runbook successfully resolves an incident, that pattern is reinforced. When automation fails and a human intervenes, the system learns from the human's actions to improve future recommendations. Postmortem data (tagged by the on-call engineer) provides ground truth for retraining classification and severity models.

Integration with Vigil: Continuous Testing and Chaos Engineering

The incident response platform doesn't operate in isolation. At Kustode, it's paired with Vigil — our continuous testing and chaos engineering framework. Vigil proactively creates incidents through controlled fault injection:

Chaos Agent — Simulates external service outages (Stedi API down), database slowdowns, and Redis failures with safety controls and automatic rollback.
Live Monitor Agent — Continuously checks health, functionality, security, and performance across all 13 medical billing services.
Incident Bridge — Critical alerts from Vigil's monitoring agents are automatically forwarded to the incident response platform, creating a closed loop from testing to detection to resolution.

This integration means we're constantly exercising the incident response pipeline. Every chaos test validates that alerts fire correctly, classification is accurate, runbooks execute properly, and notifications reach the right people.

Why ML + Runbooks, Not Just Runbooks?

Pure runbook automation (without ML classification) requires exact-match alert rules: "if alert X fires, run runbook Y." This breaks down quickly in a microservices environment:

The same symptom (high latency) can have dozens of root causes across different services.
Incident severity depends on context that changes over time (a billing service outage at 2 AM vs. 2 PM has very different urgency).
New failure modes emerge regularly as services evolve, and hard-coded rules can't adapt.

ML classification provides the soft matching and contextual awareness that rigid rules can't. The runbook engine provides the deterministic, auditable remediation that pure ML can't guarantee. Together, they cover the spectrum from "this looks like X" (ML) to "if X, do these exact steps" (runbooks).

Production Status and Results

The platform is approximately 85% complete and demo-ready. The core incident CRUD, webhook ingestion, runbook execution, Kubernetes integration, and Slack notifications are fully functional. ML classification is operational with optional escalation when confidence is low.

In practice, the system handles the routine incidents automatically (pod restarts, connection pool exhaustion, disk pressure remediation) while escalating novel failures to humans with rich diagnostic context. The goal isn't to eliminate on-call — it's to ensure that when a human is paged, they have all the context they need and the routine remediation has already been attempted.

← Back to Blog