Healthcare AI agents are making clinical decisions, processing insurance claims, and handling patient data — but the security infrastructure hasn't kept up. Research shows a 94.4% prompt injection success rate against medical LLMs (Yoo et al., JAMA 2025), and 259 million individuals were affected by healthcare data breaches in 2024. Static keyword filters aren't enough for agentic AI with tools, memory, and multi-step reasoning.

AEGIS (Adversarial Enforcement and Guardrail Interception System) is the security proxy I built to address this gap: a real-time interceptor that sits between users and healthcare AI agents, enforcing HIPAA-aligned security policies with sub-100ms latency.

The Problem: Why Healthcare AI Agents Need a New Security Model

While building SENTINEL (our MCP reasoning auditor at Kustode), security benchmarking revealed a 52.3% red-tier security probe rate — meaning more than half of adversarial inputs were bypassing basic safeguards. Three categories of attack dominate:

  • Prompt injection — Direct, indirect, and jailbreak attacks that hijack agent behavior. An attacker can embed instructions in a patient referral document that cause the AI to override its safety guidelines.
  • PHI exfiltration — Adversarial prompts that trick agents into including Protected Health Information in outputs, URLs, markdown links, base64-encoded text, or tool arguments.
  • Policy violation — Agents acting outside their authorized scope: a billing agent providing clinical advice, or an eligibility checker accessing claims data it shouldn't see.

Existing defenses fail because they operate at the wrong layer. Content filters catch keywords but miss semantic attacks. Single-model wrappers are vulnerable to the same attacks as the models they protect. AEGIS takes a different approach: defense-in-depth with three independent layers, each using a different detection methodology.

Architecture: Three Layers of Defense

AEGIS operates as a reverse proxy, intercepting all traffic between users and healthcare AI agents:

User/Client → [AEGIS PROXY] → AI Agent
                    │
      ┌─────────────┼───────────────┐
      │  L1: Input   │  L2: Semantic  │  L3: Output   │
      │  Classifier  │  Policy Engine │  Sanitizer     │
      │  (ONNX)      │  (Hardened LLM)│  (NER+Regex)   │
      └──────────────┴────────────────┴────────────────┘
                    │
              Audit Log + Metrics + Alerts

Layer 1: ML-Based Input Classification (Go + ONNX Runtime)

Every incoming request passes through a DistilBERT model fine-tuned for five-class classification: benign, direct injection, indirect injection, jailbreak, and PHI extraction attempts. The model was trained on 21,643 samples from Tensor Trust, HackAPrompt, deepset/prompt-injections, and synthetic healthcare-specific attack data. Medical benign samples come from MedQA-USMLE (12,723 questions) and PubMedQA (1,000 questions).

The model is exported to ONNX format and runs in the Go proxy via ONNX Runtime bindings, achieving 45.4ms p50 latency — well under the 100ms budget. Critically, the model produces calibrated confidence scores (ECE: 0.0038), enabling confidence-gated routing:

  • ≥ 0.85 confidence → Auto-proceed (logged for audit)
  • 0.60–0.85 → Hold-and-notify (30-second human review window)
  • < 0.60 → Block and escalate immediately
  • PHI-related operations → All thresholds multiplied by 1.5×

Layer 2: Semantic Policy Auditor (Hardened LLM)

When Layer 1 confidence falls in the ambiguous range, Layer 2 activates a hardened LLM (OpenAI, Anthropic, or Ollama) with a locked system prompt, no tool access, and no memory. This LLM performs semantic policy checks — understanding the intent behind a request, not just its surface features. It answers: "Is this request consistent with the agent's authorized scope and the organization's security policies?"

The locked configuration prevents the auditor itself from being compromised. It has no ability to call tools, access external data, or remember previous interactions — it's a stateless, single-purpose judge.

Layer 3: Output Sanitizer (Regex + Pattern Detection)

After the AI agent generates a response, Layer 3 scans the output for PHI leakage across 18 HIPAA identifier types and six exfiltration channels:

  • Direct text inclusion (SSNs, MRNs, dates of birth)
  • URL parameter encoding
  • Markdown link embedding
  • Base64-encoded payloads
  • Code block hiding
  • Tool argument smuggling

This is intentionally the simplest layer — fast regex patterns that catch what the ML model and LLM auditor might miss. Defense-in-depth means each layer compensates for the others' blind spots.

Evaluation Results

We evaluated AEGIS across all three layers using held-out test data (10% split from the training corpus) plus additional adversarial red-teaming:

Metric Result Target
Attack Success Rate (ASR) 0.25% ≤ 10%
Benign False Positive Rate 0.30% < 1%
PHI Leak Rate 0.00% < 1%
L1 Latency (p50) 45.4ms < 100ms
Expected Calibration Error 0.0038 < 0.05
Overall Accuracy 99.49% ≥ 95%

The 0.25% ASR means that out of 400 adversarial test inputs, exactly one bypassed all three layers. The 0.30% benign FPR means legitimate clinical queries almost never get incorrectly blocked — critical for a system sitting in the hot path of patient care workflows.

Why Go + ONNX Instead of Python?

The ML model was trained in Python (PyTorch + HuggingFace Transformers), but inference runs in Go. This isn't an arbitrary technology choice — it's driven by the proxy use case:

  • Latency budget — The proxy sits in the request path. Every millisecond of overhead is felt by the clinician or billing specialist waiting for an AI response. Go's goroutine model handles concurrent requests without the GIL bottleneck.
  • Deployment simplicity — A single static binary with the ONNX model embedded. No Python runtime, no dependency management, no conda environments on production nodes.
  • Memory efficiency — ONNX Runtime in Go uses significantly less memory per request than equivalent PyTorch inference, enabling deployment on smaller nodes alongside the agents it protects.

Training Data: Balancing Healthcare and Security Domains

The classifier needed to understand both adversarial AI attacks and legitimate medical language. A model trained only on security datasets would flag clinical terminology as suspicious. A model trained only on medical data would miss subtle injection patterns.

The training corpus combines:

  • Attack data (8,920 samples): Tensor Trust, HackAPrompt, deepset/prompt-injections, jailbreak-classification, ChatGPT-Jailbreak-Prompts, plus synthetic indirect injection and PHI extraction attempts
  • Benign medical data (12,723 samples): MedQA-USMLE questions covering clinical reasoning, plus 1,000 PubMedQA biomedical questions

The 80/10/10 train/val/test split ensures the model generalizes across both domains. Fine-tuning DistilBERT (rather than training from scratch) preserves the model's general language understanding while learning the attack/benign boundary specific to healthcare.

From SENTINEL to AEGIS: Defense-in-Depth for AI Agents

AEGIS is the security complement to SENTINEL. Where SENTINEL audits reasoning quality (is the agent's evidence actually reliable?), AEGIS enforces security boundaries (is the request trying to compromise the agent?). Together, they form two layers of a defense-in-depth strategy for production healthcare AI:

  • AEGIS → Stops malicious inputs from reaching the agent
  • SENTINEL → Audits the agent's reasoning before actions execute
  • Kustode platform → Multi-tenant isolation and RBAC at the data layer

What's Next

The current AEGIS implementation demonstrates that real-time, multi-layer AI security is feasible within strict latency budgets. Planned extensions include:

  • Continuous learning from production traffic (adversarial examples that reach Layer 2 feed back into Layer 1 retraining)
  • Integration with SENTINEL's reasoning audit trail for end-to-end security + quality observability
  • Open-sourcing the ONNX model and Go proxy for the healthcare AI community

The core insight: healthcare AI security requires purpose-built infrastructure, not bolted-on filters. When AI agents are making decisions about patient care and billing, the security proxy needs to be as rigorously engineered as the agents themselves.