Introduce

Researching Safe, Agentic AI for Regulated & Unregulated Industries| From applied ML research to systems that ship safely at scale

Applied AI Researcher & Engineer • Agentic AI, ML & Safe Agents • Regulated & Unregulated Industries

I research how agentic AI and ML can operate safely and verifiably across both regulated and unregulated industries — and translate that research into production systems that are reliable, observable, and compliant by design.

View Projects Contact Me

8+

Years of
Experience

30+

projects completed on
different technologies

About

Summary

I'm an applied AI researcher and engineer (MSc, Data Science) focused on agentic AI, ML, and safe agents for both regulated and unregulated industries. At Kustode, I turn that research into production: a multi-tenant platform that serves as my testbed for studying how autonomous agents can act safely, verifiably, and within strict compliance constraints. With 8+ years of hands-on production experience across ML platforms, GPU clusters, and distributed systems, I combine rigorous engineering with research into reliable, safe, and observable AI.

Currently open to opportunities. I'm looking for a role where I can fully apply and build — agentic AI & ML platforms for regulated and unregulated industries, safe-agent systems, and the reliability and observability foundations they run on. Let's talk →

What I Do

I architect cloud-native platforms that balance reliability, performance, and cost. From orchestrating Kubernetes at scale to building ML training infrastructure and observability stacks, I focus on systems that grow with your business—without ballooning your cloud bill or waking your on-call.

Technical DNA

Cloud: AWS, GCP, deep serverless expertise
Orchestration: Kubernetes, Docker Swarm at scale
IaC: Terraform, automated deployments that eliminate human error
Languages: Python, Golang for robust automation and tooling
DevOps: Jenkins, GitOps, reliable CI/CD pipelines
Data & ML Infra: Platforms for complex ML workflows

Beyond the Code

Proactive observability that prevents user impact
30–50% cloud cost reductions via intelligent resource management
Self-service platforms that enable developer velocity
Active open-source contribution and community leadership

Experience

Roles & Impact

May 2025 – May 2026 · 1 yr 1 mo

Applied AI Researcher & Platform Engineer

Kustode

Researching and building agentic AI, ML, and safe-agent systems for regulated and unregulated industries — on a production-grade, multi-tenant platform.

Distributed Systems & Architecture

Designed event-driven microservices architecture with 12+ services, async messaging (SNS/SQS patterns), and clean API contracts.
Implemented multi-tenant isolation at database layer using PostgreSQL with row-level security and foreign key constraints — zero cross-tenant data leakage.
Built real-time data pipelines processing external EDI transactions with webhook ingestion, idempotent processing, and automatic state reconciliation.
Designed resilient integration patterns: retry with exponential backoff, circuit breakers, dead-letter queues, and graceful degradation.

Observability & SRE

Instrumented end-to-end distributed tracing with OpenTelemetry — trace context propagation across HTTP, async tasks, and database calls.
Built Grafana dashboards with Prometheus metrics: service health, latency histograms, error budgets, and business KPIs.
Established SLO framework (99.9% availability, p95 latency targets) with automated alerting and on-call runbooks.
Reduced MTTR by 60% through structured logging (structlog), correlation IDs, and queryable traces in Jaeger.

Security Engineering

Implemented zero-trust auth: JWT with refresh token rotation, RBAC with permission-based access control, and session management.
Built defense-in-depth: WAF rules, rate limiting, input validation, SQL injection prevention, and audit logging.
Automated security scanning in CI/CD: static analysis (Bandit), secret detection (Gitleaks), dependency audits.
Designed secrets management with Ansible Vault — no plaintext credentials in repos or containers.

Infrastructure & Platform

Provisioned AWS infrastructure (EC2, RDS, S3, CloudWatch, SNS) with Ansible playbooks for reproducible multi-environment deployments.
Built CI/CD pipelines: GitHub Actions with parallel testing, lint gates, security checks, and blue-green deployments.
Achieved 40% p95 latency reduction via query optimization, connection pooling (asyncpg), and N+1 elimination.
Delivered 25% infrastructure cost savings through right-sizing, reserved capacity, and efficient resource utilization.

Stack: Python/FastAPI, React/Next.js, PostgreSQL, Redis, AWS, Docker, Nginx, Ansible, OpenTelemetry, Grafana, Prometheus, Jaeger

May 2023 – Jun 2024 · 1 yr 2 mos

Research Software Engineer

EcoHealth Alliance

Built ML infrastructure and GPU compute platforms for research workloads.

ML & GPU Infrastructure

Deployed and optimized LLMs on GPU clusters — improved training throughput by 35% through batch tuning and mixed-precision training.
Built model serving infrastructure with auto-scaling, health checks, and A/B deployment capabilities.
Implemented GPU resource scheduling with fair-share policies and utilization monitoring.

Cloud & Automation

Architected AWS serverless platform (Lambda, Step Functions, S3) — reduced operational costs by 30%.
Automated infrastructure with Terraform + GitOps — cut provisioning time from days to minutes.
Built containerized HPC workflows on Kubernetes — improved reproducibility and reduced analysis bottlenecks by 50%.

Stack: AWS (Lambda, EC2, S3, SageMaker), Kubernetes, Docker, Terraform, Python, PyTorch, CUDA

May 2022 – Jun 2023 · 1 yr 2 mos

Site Reliability Engineer

Sportserve

Owned reliability for high-throughput streaming infrastructure processing millions of events/day on GCP.

Reliability & Performance

Maintained 99.9% uptime and 99.95% data integrity across distributed streaming services handling peak loads of 50K+ events/sec.
Optimized BigQuery pipelines and Pub/Sub consumers — reduced query costs by 30% and p99 latency by 40%.
Implemented capacity planning models using historical traffic patterns to prevent over-provisioning.

Observability & Incident Response

Built unified observability stack: GCP Monitoring, Prometheus, OpenTelemetry with custom dashboards for real-time anomaly detection.
Led incident response and blameless postmortems — reduced repeat incidents by 50% through systematic action items.
Implemented SLO-based alerting with error budgets — cut alert noise by 60% while improving signal quality.

Stack: GCP (BigQuery, Pub/Sub, GKE, Cloud Functions), Kubernetes, Prometheus, Grafana, OpenTelemetry, Python, Go

Aug 2021 – May 2022 · 10 mos

DevOps Tech Lead

Sarami

Led platform modernization from monolith to microservices.

Platform Modernization

Led monolith-to-microservices migration on AWS — reduced deployment time by 60% and improved fault isolation.
Built GitOps CI/CD pipeline (GitLab CI + Terraform + Ansible) — accelerated releases from weekly to daily deployments.
Implemented infrastructure as code with versioned, peer-reviewed changes and automated rollbacks.

Observability

Deployed Prometheus + Grafana stack with custom alerting rules — reduced MTTR by 40%.
Implemented centralized logging (ELK) with structured formats and correlation IDs.

Stack: AWS (ECS, RDS, ElastiCache), Docker, Terraform, Ansible, GitLab CI, Prometheus, Grafana, ELK

Aug 2019 – May 2022 · 2 yrs 10 mos

System Engineer / Data Engineer

ICIPE — International Centre of Insect Physiology and Ecology

Built and operated HPC infrastructure and data pipelines for compute-intensive scientific workloads.

HPC & Compute Infrastructure

Architected hybrid HPC platform (AWS + on-prem) with Slurm scheduler — scaled to 500+ concurrent jobs.
Optimized batch processing pipelines — reduced workflow execution time by 50% through parallelization and resource tuning.
Implemented container orchestration (Docker, Singularity, Kubernetes) for reproducible scientific workflows.

CI/CD & Automation

Built CI/CD pipelines (Jenkins, CircleCI) with automated testing, container builds, and deployment gates.
Implemented Prometheus/Grafana monitoring — achieved 90%+ uptime with proactive alerting.
Automated infrastructure provisioning with Terraform — reduced setup time from weeks to hours.

Data Engineering

Built ETL pipelines processing TB-scale datasets with SQL-based transformations and data quality checks.
Designed ML pipelines for real-time data analysis with model versioning and experiment tracking.

Stack: AWS, Slurm, Kubernetes, Docker, Singularity, Jenkins, Terraform, Prometheus, Grafana, PostgreSQL, Python

Education

2024 - 2026

MSc in Data Science

Saint Peter's University

2013 - 2018

BSc in Computer Technology

Jomo Kenyatta University of Agriculture and Technology

Certifications

View verified credentials on Credly →

Expertise

Technical Capabilities

Cloud Strategy & Scale

Design and govern cloud foundations (AWS/GCP) aligned to Well‑Architected principles for reliability, performance, and cost control.

Multiple Projects

Container Platforms

Operate resilient Kubernetes/container platforms with policy, security, and golden paths for application teams.

Multiple Projects

Observability & SLOs

End‑to‑end telemetry (logs/metrics/traces) and SLO programs for clear ownership, faster diagnosis, and proactive reliability.

Multiple Projects

Data Platforms

Build data foundations for analytics/ML with governance, lineage, and scalable pipelines.

Multiple Projects

Delivery Automation

Standardize CI/CD with security and compliance built‑in to accelerate releases safely.

Multiple Projects

Distributed & HPC

Enable high‑throughput compute and storage for intensive and scientific workloads.

Multiple Projects

ML & GPU Enablement

Operationalize training/inference on GPUs with capacity planning and cost governance.

Multiple Projects

Infrastructure as Code

Codify infrastructure and policy for repeatability, auditability, and speed.

Multiple Projects

Capabilities

Technical Foundations

GPU & AI Infrastructure

GPU Resource Management, CUDA, Performance Optimization
Kubernetes GPU Operator, Resource Scheduling
Machine Learning Infrastructure, Model Deployment

Automation Languages

Python, Java, Golang, JavaScript, R, Rust

Container & Orchestration Platforms

Docker, Singularity, Kubernetes, Docker Swarm

Infrastructure as Code

Terraform, Ansible, Helm

Cloud Foundations

AWS, GCP, Kubernetes, Terraform
Infrastructure as Code, GitOps
High Availability Architecture

Delivery Tooling

Jenkins, CircleCI, GitHub Actions

Observability Stack

Grafana, Prometheus, ELK Stack, Open Telemetry, Honeycomb

Distributed Systems

Distributed Computing, Scalable Architecture
Message Queues, Event-Driven Systems
Consensus Protocols, Eventual Consistency

Data Stores

PostgreSQL, MongoDB, MySQL, Cassandra

Compute & Version Control

Linux, Windows, GitHub, GitLab

Delivery & Collaboration

Asana, Jira, Confluence, Opsgenie

Research

Research & Open Source

Completed Work

VGAC — Predictable GPU Scheduling

Production deployment on EKS, tracking 12,000+ job events

AUROC: 0.969, ECE: 0.005 — well-calibrated predictions
Predicts job start times for GPU clusters (Slurm/Kubernetes)
Live at vgac.cloud

LLM Observability for Log Analysis

Schema-strict log intelligence with statistical validation

Calibration metrics, McNemar's test, bootstrap CIs
Confidence-gated alerting to reduce noise
Cost-aware inference tracking

Current Research Areas

Agentic AI & Safe Agents: verifiable, policy‑bounded autonomous agents for regulated and unregulated industries
ML in Regulated & Unregulated Industries: compliant, auditable ML for high-stakes workflows
AI Governance & Observability: reasoning audits, calibration (ECE/Brier), confidence‑gated alerting
ML/GPU Cluster Efficiency: under‑utilization detection, wait‑time risk modeling, right‑sizing
eBPF & Systems Telemetry: low‑overhead kernel/network tracing for AI workloads

Open Source

agent-secure (SENTINEL) — AI reasoning observatory & governance layer for MCP agents (Go)
llm-observability — LLM-powered log analysis with calibration
Calibrated-Queue-Delay-Prediction — GPU queue wait-time forecasting
k8s-debug-tui — Terminal UI for Kubernetes debugging
gpu-ml-framework — GPU cluster ML infrastructure

Speaking

Talks, Events & Community

Speaker Profile on Sessionize

Upcoming Events

HBMA — Healthcare Business Management Association

Virtual Webinar | July 22, 2026 · 2 PM ET

Presenting

AI Agents Are Making Healthcare Decisions. Who's Watching the Reasoning?

Detecting and preventing silent AI failure in healthcare RCM — the "Pattern Delta" signature (stale policy + incomplete retrieval), mid-reasoning interception, and a three-tier confidence gating system (auto-proceed / human review / block). Drawn from production systems and 150 benchmarked prior-authorization, claims, and denial cases. Eligible for 1 CHBME credit.

MCP Dev Summit North America

New York, NY | April 2–3, 2026

Accepted

From 60 Minutes To 60 Seconds: Production MCP Workflows for Healthcare Billing

How we integrated MCP into Kustode's multi-tenant RCM platform to automate denial management, prior authorization, and claim intervention — multi-tenant MCP architecture with PHI isolation and security patterns for regulated environments.

ISS 2026 — Improving Scientific Software Conference

NCAR, Boulder, CO | April 6–9, 2026

Accepted + Travel Support

VGAC: Building a GPU Cluster Observability Platform with Predictive Queue Intelligence

Open-source observability platform that transforms passive GPU cluster monitoring into proactive scheduling intelligence through calibrated queue delay predictions. Deployed on production EKS with AUC-ROC 0.756 and ECE 0.077.

PlatformCon 2026

Virtual | 2026

Accepted

When AI Ships Faster Than Humans Can Review: AI-Native CI/CD Pipelines

The shift from AI-assisted to AI-native CI/CD — automated quality gates, AI code review with feedback loops, deploy safety, and metrics-driven governance across 12+ microservices at Kustode.

Past Talks

Learning Behavioral Signals for ML Cluster Efficiency

Google NY SRE Tech Talks (Dec 16, 2025). Using standard cluster logs to predict under‑utilization and queue risk — toward proactive, data‑driven reliability.

SRE for Streaming AI: Building Resilient Platforms to Combat Model Drift

Data Streaming Summit (Virtual) 2025 — Applying SRE principles to detect and respond to model drift in real time across streaming architectures.

KCD NYC Roundtable — Platform Engineering and Observability

From Chaos to Clarity: Platform Engineering & Observability (KCD NYC)

KCD New York roundtable recap on correlating telemetry, reducing alert fatigue, and using IDPs and LLMs to improve signal and standardization.

Ambassadorships

Agentic AI Foundation

Ambassador · 2026

Advancing safe, open, and responsible agentic AI across the community.

RenderATL

Ambassador · 2026

Atlanta, GA · Aug 12–13, 2026 — championing the Silicon South tech community.

Portfolio

Selected Impact

SENTINEL — AI Reasoning Observatory for MCP Agent Governance

Pre-decision reasoning auditor for AI agents. Intercepts decisions before they execute, auditing evidence quality at the signal level — secured by Solo.io's agentgateway at the MCP protocol layer.

Hackathon submission: Secure & Govern MCP — aihackathon.dev

Caught 89%-confident agent making 23%-accurate decisions
Payer-specific drift detection (Aetna: 84% → 44%)
MCP tool-level RBAC via agentgateway + CEL policies
Sponsor integrations: Datadog, Braintrust, Cleric, ElevenLabs

AEGIS — Real-Time Security Proxy for Healthcare AI Agents

Three-layer defense proxy (ML classifier + hardened LLM auditor + output sanitizer) that protects healthcare AI agents from prompt injection and PHI exfiltration. Built in Go with ONNX Runtime for sub-100ms inference.

Being integrated into the Kustode production platform for protecting MCP-based healthcare billing agents.

0.25% attack success rate (target: ≤ 10%)
0.00% PHI leakage across 18 HIPAA identifier types
45.4ms p50 latency — fits in the request path
99.49% overall accuracy with ECE 0.0038 calibration
Trained on 21,643 samples (Tensor Trust, HackAPrompt, MedQA-USMLE)

VGAC — Predictable Scheduling for GPU Clusters

Product and research initiative that learns cluster behavior (Slurm/Kubernetes) to predict job start times, improve utilization, and enable data‑driven capacity planning.

Results at a glance:

Reliable start‑time predictions (e.g., ±15 min) reduce “when will it run?” uncertainty
Higher experiment velocity with clear best‑time submission guidance
Improved GPU utilization via pattern‑aware scheduling insights
Faster capacity decisions with visibility into queue dynamics and bottlenecks

Proactive Model Quality Monitoring

Introduced a drift‑aware monitoring capability that reduces undetected model decay and accelerates response time with adaptive thresholds and real‑time dashboards.

Outcome: Faster incident detection; improved ML service reliability.

MTTR reduced for model incidents
Early‑warning signals on data drift
Dashboards for leadership visibility

Executive Observability: Log Intelligence

Delivered structured, confidence‑scored incident intelligence from raw logs to reduce triage time and elevate operational decision‑making.

Outcome: Shorter MTTR, clearer executive signal, and cost‑aware analysis.

Triage time reduced with structured alerts
Confidence‑gated notifications to limit noise
Cost/performance tracked for governance

Resilient Data Platform Foundation

Established a fault‑tolerant data substrate leveraging consistent hashing and replication for predictable scale and uptime under node churn.

Outcome: Higher availability and predictable performance at scale.

Graceful degradation under failures
Consistent performance at growth
Operational runbooks and SLOs

Risk Forecasting Pipeline

Built an explainable risk‑scoring workflow to support credit decisions and reduce exposure through early warning signals.

Outcome: Better risk visibility; faster portfolio interventions.

Explainable scores for decisions
Earlier detection of high‑risk cases
Operational dashboards for stakeholders

Data Observability Modernization

Consolidated metrics, logs, and traces into actionable views with automated alerting and SLO‑aligned dashboards.

Outcome: Faster detection, clearer ownership, improved reliability.

Kustode — Production Multi-Tenant SaaS Platform

Co-Founded and architected from zero to production — a distributed, event-driven platform handling real-time transaction processing with enterprise-grade reliability.

Architecture & Engineering:

12+ Microservices with event-driven architecture, async messaging (SNS/SQS patterns), and clean API boundaries
Multi-Tenant Isolation: PostgreSQL row-level security, organization-scoped queries, zero cross-tenant data leakage
Real-Time Pipelines: Webhook ingestion, idempotent processing, state machines, automatic reconciliation
Resilience Patterns: Circuit breakers, retry with backoff, dead-letter queues, graceful degradation

Observability & SRE:

Distributed Tracing: OpenTelemetry instrumentation across HTTP, async tasks, and DB calls
Metrics & Dashboards: Prometheus + Grafana with latency histograms, error budgets, SLO tracking
Structured Logging: Correlation IDs, trace context, queryable in Jaeger

Results:

99.9%+ availability with automated alerting and incident response
60% MTTR reduction through end-to-end tracing and structured debugging
40% p95 latency improvement via query optimization, connection pooling, N+1 elimination
25% infra cost savings through right-sizing and efficient resource utilization

Stack: Python/FastAPI, React/Next.js, PostgreSQL, Redis, AWS (EC2, RDS, S3, SNS, CloudWatch), Docker, Nginx, Ansible, OpenTelemetry, Grafana, Prometheus, Jaeger

Retail Operations & Finance Suite

Delivered a cross‑platform solution for operations, reconciliation, and financial controls with built‑in resilience.

Cross‑platform access (desktop) with robust state management
Automated reconciliation and tax workflows to reduce manual effort
Hardened backup and export capabilities for continuity

Technologies: Electron, React, Redux, SQLite, Node.js, Material-UI

Unified Microservices Telemetry

Standardized tracing and metrics to create one source of truth across services, enabling faster diagnosis and capacity planning.

Outcome: Lower MTTR; better planning signals for scaling.

LLM & NLP Enablement

Operationalized model training/inference pipelines on GPUs with governance for reliability and cost.

Outcome: Faster iteration and stable deployment pathways.

Enterprise Cloud Migration

Led modernization to AWS with IaC and automated delivery, minimizing downtime and accelerating release cadence.

Outcome: Reduced risk, faster time‑to‑value, standardized operations.

HPC for Research Acceleration

Provisioned a scalable compute backbone with containerized workflows and batch scheduling to speed research cycles.

Outcome: Higher throughput on data‑intensive workloads.

Security & Compliance Hardening

Implemented policy‑driven security controls and monitoring for regulated environments with automated checks and audit trails.

Outcome: Reduced risk profile and improved compliance readiness.

Prompt-Engineering for Open-Source LLMs

Developed prompt-engineering techniques for enhancing the performance of open-source large language models (LLMs). Implemented and fine-tuned prompts to improve the accuracy and applicability of LLMs in various use cases, such as text summarization, sentiment analysis, and information extraction.

Tools Used: Python, TensorFlow, PyTorch, OpenAI GPT, Docker.

Data Science Platform Enablement

Standardized notebook environments and governed execution paths to improve reproducibility and collaboration.

Outcome: Faster onboarding and consistent results across teams.

Parallel Compute on AWS

Enabled managed HPC capabilities (Batch/Slurm) for elastic, policy‑driven compute aligned to workload needs.

Outcome: Elastic scale with governance and cost control.

Serverless Modernization

Adopted serverless building blocks (Lambda, API, identity, CDN) with observability to reduce ops burden and time‑to‑market.

Outcome: Lower run costs and simplified operations.

LLM Calibration & Reliability Observatory

Interactive simulation demonstrating isotonic calibration for LLM confidence scores, drift detection with automated recalibration, and building SLIs/SLOs for semantic reliability.

Try it live: Run the simulation, inject drift, and watch the system detect, alert, and self-heal.

Isotonic regression (PAV) for confidence calibration
ECE, Brier, MCE metrics with bootstrap CI
Automated drift detection & recalibration
SLI/SLO framework with burn-rate alerts

Say Hello

Let's Connect

Whether it's about infrastructure challenges, research ideas, or just a good tech conversation — I'd love to hear from you.

Topics I Love to Chat About:

Platform Engineering & Cloud Architecture
ML Infrastructure & GPU Clusters
Observability & SRE
Open Source & Research Collaboration

I'm always open to interesting conversations about infrastructure, distributed systems, or research ideas. If you're working on something cool or just want to chat about tech, reach out!

Email: masundeespira@gmail.com

GitHub: github.com/espirado

LinkedIn: linkedin.com/in/andrew-espira

Things I'm Currently Interested In:

GPU cluster optimization & scheduling
eBPF for low-overhead observability
ML infrastructure at scale
Cost-aware cloud architectures
Open source collaboration

Currently exploring GPU computing research, eBPF for observability, and sustainable cloud architectures. Always happy to exchange ideas!

Let's Chat

Andrew Espira

Introduce

Researching Safe, Agentic AI for Regulated & Unregulated Industries| From applied ML research to systems that ship safely at scale

8+

30+

About

Summary

What I Do

Technical DNA

Beyond the Code

Experience

Roles & Impact

Applied AI Researcher & Platform Engineer

Distributed Systems & Architecture

Observability & SRE

Security Engineering

Infrastructure & Platform

Research Software Engineer

ML & GPU Infrastructure

Cloud & Automation

Site Reliability Engineer

Reliability & Performance

Observability & Incident Response

DevOps Tech Lead

Platform Modernization

Observability

System Engineer / Data Engineer

HPC & Compute Infrastructure

CI/CD & Automation

Data Engineering

Education

MSc in Data Science

BSc in Computer Technology

Certifications

Expertise

Technical Capabilities

Cloud Strategy & Scale

Container Platforms

Observability & SLOs

Data Platforms

Delivery Automation

Distributed & HPC

ML & GPU Enablement

Infrastructure as Code

Capabilities

Technical Foundations

Research

Research & Open Source

Completed Work

Current Research Areas

Open Source

Speaking

Talks, Events & Community

Upcoming Events

Past Talks

Ambassadorships

Agentic AI Foundation

Portfolio

Selected Impact

Retail Operations & Finance Suite

Say Hello

Let's Connect

Whether it's about infrastructure challenges, research ideas, or just a good tech conversation — I'd love to hear from you.

Topics I Love to Chat About: