Incident response in Kubernetes environments follows a predictable pattern: an alert fires, an on-call engineer wakes up, spends 15 minutes understanding the context, runs through a mental (or written) runbook, executes remediation steps, and writes a postmortem. Most of these steps are automatable — yet most teams still handle them manually, losing precious minutes during every incident.
At Kustode, we built an incident response platform that automates the detection-to-resolution pipeline for our 12+ microservice healthcare platform. The system uses ML for incident classification and severity assessment, configurable runbooks for automated remediation, and intelligent escalation when automation isn't confident enough to act alone.
The platform is designed as a pipeline where each stage adds intelligence to raw alerts:
Alert Sources Ingestion ML Orchestrator
┌────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Prometheus │───webhook────▶│ Normalizer │────▶│ Classification │
│ AlertManager│ │ + Validator │ │ Severity │
│ Loki │ └──────────────┘ │ Root Cause │
│ Custom │ │ Runbook Selection│
└────────────┘ └────────┬─────────┘
│
┌────────▼─────────┐
│ Runbook Engine │
│ K8s actions │
│ HTTP calls │
│ Scripts │
│ Notifications │
│ Manual approval │
└────────┬─────────┘
│
┌────────▼─────────┐
│ Integrations │
│ Slack │ PagerDuty│
│ K8s │ Prometheus│
└──────────────────┘
Alerts arrive from heterogeneous sources — Prometheus AlertManager webhooks, Loki log-based alerts, and custom application health checks. The ingestion layer normalizes them into a common schema:
Once normalized, incidents pass through an ML orchestrator that makes three decisions:
Classification — What type of incident is this? Categories include infrastructure (node failure, disk pressure), application (OOM kill, crash loop, high latency), data (pipeline failure, ingestion delay), and security (unauthorized access, certificate expiry). Classification uses an XGBoost model trained on historical incident data, with features extracted from the alert content, affected service, time of day, and recent deployment history.
Severity assessment — How urgent is this? Severity considers blast radius (how many services/tenants are affected), customer impact (is this blocking billing operations?), and historical escalation patterns (incidents like this one usually get escalated within 15 minutes).
Root cause hypothesis — What likely caused this? For well-known patterns (e.g., OOM kills after a deployment, high latency correlated with database connection pool exhaustion), the system generates a root cause hypothesis with confidence score. Low-confidence hypotheses are presented as suggestions rather than conclusions.
The runbook engine is the remediation layer. Runbooks are defined as YAML configurations with typed steps:
name: high-latency-database-connection-pool
trigger:
classification: application.high_latency
service_pattern: "claims-*|billing-*"
condition: "metrics.p99_latency > 2000ms"
steps:
- type: prometheus_query
query: 'pg_stat_activity_count{service="{{ service }}"}'
save_as: active_connections
- type: conditional
condition: "active_connections > pool_max * 0.9"
if_true:
- type: kubernetes
action: rollout_restart
target: "deployment/{{ service }}"
namespace: "{{ namespace }}"
- type: notification
channel: slack
message: "Restarted {{ service }} due to connection pool exhaustion"
if_false:
- type: notification
channel: slack
message: "High latency on {{ service }} but connection pool is healthy. Manual investigation needed."
escalate: true
- type: prometheus_query
query: 'http_request_duration_seconds{service="{{ service }}", quantile="0.99"}'
wait: 120s
assert: "value < 1000"
on_failure:
- type: notification
channel: pagerduty
severity: high
Runbook steps support multiple action types:
Every automated response feeds back into the ML models. When a runbook successfully resolves an incident, that pattern is reinforced. When automation fails and a human intervenes, the system learns from the human's actions to improve future recommendations. Postmortem data (tagged by the on-call engineer) provides ground truth for retraining classification and severity models.
The incident response platform doesn't operate in isolation. At Kustode, it's paired with Vigil — our continuous testing and chaos engineering framework. Vigil proactively creates incidents through controlled fault injection:
This integration means we're constantly exercising the incident response pipeline. Every chaos test validates that alerts fire correctly, classification is accurate, runbooks execute properly, and notifications reach the right people.
Pure runbook automation (without ML classification) requires exact-match alert rules: "if alert X fires, run runbook Y." This breaks down quickly in a microservices environment:
ML classification provides the soft matching and contextual awareness that rigid rules can't. The runbook engine provides the deterministic, auditable remediation that pure ML can't guarantee. Together, they cover the spectrum from "this looks like X" (ML) to "if X, do these exact steps" (runbooks).
The platform is approximately 85% complete and demo-ready. The core incident CRUD, webhook ingestion, runbook execution, Kubernetes integration, and Slack notifications are fully functional. ML classification is operational with optional escalation when confidence is low.
In practice, the system handles the routine incidents automatically (pod restarts, connection pool exhaustion, disk pressure remediation) while escalating novel failures to humans with rich diagnostic context. The goal isn't to eliminate on-call — it's to ensure that when a human is paged, they have all the context they need and the routine remediation has already been attempted.