Healthcare AI agents are making clinical decisions, processing insurance claims, and handling patient data — but the security infrastructure hasn't kept up. Research shows a 94.4% prompt injection success rate against medical LLMs (Yoo et al., JAMA 2025), and 259 million individuals were affected by healthcare data breaches in 2024. Static keyword filters aren't enough for agentic AI with tools, memory, and multi-step reasoning.
AEGIS (Adversarial Enforcement and Guardrail Interception System) is the security proxy I built to address this gap: a real-time interceptor that sits between users and healthcare AI agents, enforcing HIPAA-aligned security policies with sub-100ms latency.
While building SENTINEL (our MCP reasoning auditor at Kustode), security benchmarking revealed a 52.3% red-tier security probe rate — meaning more than half of adversarial inputs were bypassing basic safeguards. Three categories of attack dominate:
Existing defenses fail because they operate at the wrong layer. Content filters catch keywords but miss semantic attacks. Single-model wrappers are vulnerable to the same attacks as the models they protect. AEGIS takes a different approach: defense-in-depth with three independent layers, each using a different detection methodology.
AEGIS operates as a reverse proxy, intercepting all traffic between users and healthcare AI agents:
User/Client → [AEGIS PROXY] → AI Agent
│
┌─────────────┼───────────────┐
│ L1: Input │ L2: Semantic │ L3: Output │
│ Classifier │ Policy Engine │ Sanitizer │
│ (ONNX) │ (Hardened LLM)│ (NER+Regex) │
└──────────────┴────────────────┴────────────────┘
│
Audit Log + Metrics + Alerts
Every incoming request passes through a DistilBERT model fine-tuned for five-class classification: benign, direct injection, indirect injection, jailbreak, and PHI extraction attempts. The model was trained on 21,643 samples from Tensor Trust, HackAPrompt, deepset/prompt-injections, and synthetic healthcare-specific attack data. Medical benign samples come from MedQA-USMLE (12,723 questions) and PubMedQA (1,000 questions).
The model is exported to ONNX format and runs in the Go proxy via ONNX Runtime bindings, achieving 45.4ms p50 latency — well under the 100ms budget. Critically, the model produces calibrated confidence scores (ECE: 0.0038), enabling confidence-gated routing:
When Layer 1 confidence falls in the ambiguous range, Layer 2 activates a hardened LLM (OpenAI, Anthropic, or Ollama) with a locked system prompt, no tool access, and no memory. This LLM performs semantic policy checks — understanding the intent behind a request, not just its surface features. It answers: "Is this request consistent with the agent's authorized scope and the organization's security policies?"
The locked configuration prevents the auditor itself from being compromised. It has no ability to call tools, access external data, or remember previous interactions — it's a stateless, single-purpose judge.
After the AI agent generates a response, Layer 3 scans the output for PHI leakage across 18 HIPAA identifier types and six exfiltration channels:
This is intentionally the simplest layer — fast regex patterns that catch what the ML model and LLM auditor might miss. Defense-in-depth means each layer compensates for the others' blind spots.
We evaluated AEGIS across all three layers using held-out test data (10% split from the training corpus) plus additional adversarial red-teaming:
| Metric | Result | Target |
|---|---|---|
| Attack Success Rate (ASR) | 0.25% | ≤ 10% |
| Benign False Positive Rate | 0.30% | < 1% |
| PHI Leak Rate | 0.00% | < 1% |
| L1 Latency (p50) | 45.4ms | < 100ms |
| Expected Calibration Error | 0.0038 | < 0.05 |
| Overall Accuracy | 99.49% | ≥ 95% |
The 0.25% ASR means that out of 400 adversarial test inputs, exactly one bypassed all three layers. The 0.30% benign FPR means legitimate clinical queries almost never get incorrectly blocked — critical for a system sitting in the hot path of patient care workflows.
The ML model was trained in Python (PyTorch + HuggingFace Transformers), but inference runs in Go. This isn't an arbitrary technology choice — it's driven by the proxy use case:
The classifier needed to understand both adversarial AI attacks and legitimate medical language. A model trained only on security datasets would flag clinical terminology as suspicious. A model trained only on medical data would miss subtle injection patterns.
The training corpus combines:
The 80/10/10 train/val/test split ensures the model generalizes across both domains. Fine-tuning DistilBERT (rather than training from scratch) preserves the model's general language understanding while learning the attack/benign boundary specific to healthcare.
AEGIS is the security complement to SENTINEL. Where SENTINEL audits reasoning quality (is the agent's evidence actually reliable?), AEGIS enforces security boundaries (is the request trying to compromise the agent?). Together, they form two layers of a defense-in-depth strategy for production healthcare AI:
The current AEGIS implementation demonstrates that real-time, multi-layer AI security is feasible within strict latency budgets. Planned extensions include:
The core insight: healthcare AI security requires purpose-built infrastructure, not bolted-on filters. When AI agents are making decisions about patient care and billing, the security proxy needs to be as rigorously engineered as the agents themselves.