Introduce

Researching Safe, Agentic AI for Regulated & Unregulated Industries| From applied ML research to systems that ship safely at scale

Applied AI Researcher & Engineer • Agentic AI, ML & Safe Agents • Regulated & Unregulated Industries

I research how agentic AI and ML can operate safely and verifiably across both regulated and unregulated industries — and translate that research into production systems that are reliable, observable, and compliant by design.

Rounded Text

8+

Years of
Experience

30+

projects completed on
different technologies

About

Summary

I'm an applied AI researcher and engineer (MSc, Data Science) focused on agentic AI, ML, and safe agents for both regulated and unregulated industries. At Kustode, I turn that research into production: a multi-tenant platform that serves as my testbed for studying how autonomous agents can act safely, verifiably, and within strict compliance constraints. With 8+ years of hands-on production experience across ML platforms, GPU clusters, and distributed systems, I combine rigorous engineering with research into reliable, safe, and observable AI.

Currently open to opportunities. I'm looking for a role where I can fully apply and build — agentic AI & ML platforms for regulated and unregulated industries, safe-agent systems, and the reliability and observability foundations they run on. Let's talk →

What I Do

I architect cloud-native platforms that balance reliability, performance, and cost. From orchestrating Kubernetes at scale to building ML training infrastructure and observability stacks, I focus on systems that grow with your business—without ballooning your cloud bill or waking your on-call.

Technical DNA

  • Cloud: AWS, GCP, deep serverless expertise
  • Orchestration: Kubernetes, Docker Swarm at scale
  • IaC: Terraform, automated deployments that eliminate human error
  • Languages: Python, Golang for robust automation and tooling
  • DevOps: Jenkins, GitOps, reliable CI/CD pipelines
  • Data & ML Infra: Platforms for complex ML workflows

Beyond the Code

  • Proactive observability that prevents user impact
  • 30–50% cloud cost reductions via intelligent resource management
  • Self-service platforms that enable developer velocity
  • Active open-source contribution and community leadership

Experience

Roles & Impact

May 2025 – May 2026 · 1 yr 1 mo

Applied AI Researcher & Platform Engineer

Kustode

Researching and building agentic AI, ML, and safe-agent systems for regulated and unregulated industries — on a production-grade, multi-tenant platform.

Distributed Systems & Architecture

  • Designed event-driven microservices architecture with 12+ services, async messaging (SNS/SQS patterns), and clean API contracts.
  • Implemented multi-tenant isolation at database layer using PostgreSQL with row-level security and foreign key constraints — zero cross-tenant data leakage.
  • Built real-time data pipelines processing external EDI transactions with webhook ingestion, idempotent processing, and automatic state reconciliation.
  • Designed resilient integration patterns: retry with exponential backoff, circuit breakers, dead-letter queues, and graceful degradation.

Observability & SRE

  • Instrumented end-to-end distributed tracing with OpenTelemetry — trace context propagation across HTTP, async tasks, and database calls.
  • Built Grafana dashboards with Prometheus metrics: service health, latency histograms, error budgets, and business KPIs.
  • Established SLO framework (99.9% availability, p95 latency targets) with automated alerting and on-call runbooks.
  • Reduced MTTR by 60% through structured logging (structlog), correlation IDs, and queryable traces in Jaeger.

Security Engineering

  • Implemented zero-trust auth: JWT with refresh token rotation, RBAC with permission-based access control, and session management.
  • Built defense-in-depth: WAF rules, rate limiting, input validation, SQL injection prevention, and audit logging.
  • Automated security scanning in CI/CD: static analysis (Bandit), secret detection (Gitleaks), dependency audits.
  • Designed secrets management with Ansible Vault — no plaintext credentials in repos or containers.

Infrastructure & Platform

  • Provisioned AWS infrastructure (EC2, RDS, S3, CloudWatch, SNS) with Ansible playbooks for reproducible multi-environment deployments.
  • Built CI/CD pipelines: GitHub Actions with parallel testing, lint gates, security checks, and blue-green deployments.
  • Achieved 40% p95 latency reduction via query optimization, connection pooling (asyncpg), and N+1 elimination.
  • Delivered 25% infrastructure cost savings through right-sizing, reserved capacity, and efficient resource utilization.

Stack: Python/FastAPI, React/Next.js, PostgreSQL, Redis, AWS, Docker, Nginx, Ansible, OpenTelemetry, Grafana, Prometheus, Jaeger

May 2023 – Jun 2024 · 1 yr 2 mos

Research Software Engineer

EcoHealth Alliance

Built ML infrastructure and GPU compute platforms for research workloads.

ML & GPU Infrastructure

  • Deployed and optimized LLMs on GPU clusters — improved training throughput by 35% through batch tuning and mixed-precision training.
  • Built model serving infrastructure with auto-scaling, health checks, and A/B deployment capabilities.
  • Implemented GPU resource scheduling with fair-share policies and utilization monitoring.

Cloud & Automation

  • Architected AWS serverless platform (Lambda, Step Functions, S3) — reduced operational costs by 30%.
  • Automated infrastructure with Terraform + GitOps — cut provisioning time from days to minutes.
  • Built containerized HPC workflows on Kubernetes — improved reproducibility and reduced analysis bottlenecks by 50%.

Stack: AWS (Lambda, EC2, S3, SageMaker), Kubernetes, Docker, Terraform, Python, PyTorch, CUDA

May 2022 – Jun 2023 · 1 yr 2 mos

Site Reliability Engineer

Sportserve

Owned reliability for high-throughput streaming infrastructure processing millions of events/day on GCP.

Reliability & Performance

  • Maintained 99.9% uptime and 99.95% data integrity across distributed streaming services handling peak loads of 50K+ events/sec.
  • Optimized BigQuery pipelines and Pub/Sub consumers — reduced query costs by 30% and p99 latency by 40%.
  • Implemented capacity planning models using historical traffic patterns to prevent over-provisioning.

Observability & Incident Response

  • Built unified observability stack: GCP Monitoring, Prometheus, OpenTelemetry with custom dashboards for real-time anomaly detection.
  • Led incident response and blameless postmortems — reduced repeat incidents by 50% through systematic action items.
  • Implemented SLO-based alerting with error budgets — cut alert noise by 60% while improving signal quality.

Stack: GCP (BigQuery, Pub/Sub, GKE, Cloud Functions), Kubernetes, Prometheus, Grafana, OpenTelemetry, Python, Go

Aug 2021 – May 2022 · 10 mos

DevOps Tech Lead

Sarami

Led platform modernization from monolith to microservices.

Platform Modernization

  • Led monolith-to-microservices migration on AWS — reduced deployment time by 60% and improved fault isolation.
  • Built GitOps CI/CD pipeline (GitLab CI + Terraform + Ansible) — accelerated releases from weekly to daily deployments.
  • Implemented infrastructure as code with versioned, peer-reviewed changes and automated rollbacks.

Observability

  • Deployed Prometheus + Grafana stack with custom alerting rules — reduced MTTR by 40%.
  • Implemented centralized logging (ELK) with structured formats and correlation IDs.

Stack: AWS (ECS, RDS, ElastiCache), Docker, Terraform, Ansible, GitLab CI, Prometheus, Grafana, ELK

Aug 2019 – May 2022 · 2 yrs 10 mos

System Engineer / Data Engineer

ICIPE — International Centre of Insect Physiology and Ecology

Built and operated HPC infrastructure and data pipelines for compute-intensive scientific workloads.

HPC & Compute Infrastructure

  • Architected hybrid HPC platform (AWS + on-prem) with Slurm scheduler — scaled to 500+ concurrent jobs.
  • Optimized batch processing pipelines — reduced workflow execution time by 50% through parallelization and resource tuning.
  • Implemented container orchestration (Docker, Singularity, Kubernetes) for reproducible scientific workflows.

CI/CD & Automation

  • Built CI/CD pipelines (Jenkins, CircleCI) with automated testing, container builds, and deployment gates.
  • Implemented Prometheus/Grafana monitoring — achieved 90%+ uptime with proactive alerting.
  • Automated infrastructure provisioning with Terraform — reduced setup time from weeks to hours.

Data Engineering

  • Built ETL pipelines processing TB-scale datasets with SQL-based transformations and data quality checks.
  • Designed ML pipelines for real-time data analysis with model versioning and experiment tracking.

Stack: AWS, Slurm, Kubernetes, Docker, Singularity, Jenkins, Terraform, Prometheus, Grafana, PostgreSQL, Python


Education

2024 - 2026

MSc in Data Science

Saint Peter's University

2013 - 2018

BSc in Computer Technology

Jomo Kenyatta University of Agriculture and Technology

Expertise

Technical Capabilities

Cloud Strategy & Scale

Design and govern cloud foundations (AWS/GCP) aligned to Well‑Architected principles for reliability, performance, and cost control.

Multiple Projects

Container Platforms

Operate resilient Kubernetes/container platforms with policy, security, and golden paths for application teams.

Multiple Projects

Observability & SLOs

End‑to‑end telemetry (logs/metrics/traces) and SLO programs for clear ownership, faster diagnosis, and proactive reliability.

Multiple Projects

Data Platforms

Build data foundations for analytics/ML with governance, lineage, and scalable pipelines.

Multiple Projects

Delivery Automation

Standardize CI/CD with security and compliance built‑in to accelerate releases safely.

Multiple Projects

Distributed & HPC

Enable high‑throughput compute and storage for intensive and scientific workloads.

Multiple Projects

ML & GPU Enablement

Operationalize training/inference on GPUs with capacity planning and cost governance.

Multiple Projects

Infrastructure as Code

Codify infrastructure and policy for repeatability, auditability, and speed.

Multiple Projects

Capabilities

Technical Foundations

GPU Computing

GPU & AI Infrastructure

  • GPU Resource Management, CUDA, Performance Optimization
  • Kubernetes GPU Operator, Resource Scheduling
  • Machine Learning Infrastructure, Model Deployment
Programming Languages

Automation Languages

  • Python, Java, Golang, JavaScript, R, Rust
Containerization and Orchestration

Container & Orchestration Platforms

  • Docker, Singularity, Kubernetes, Docker Swarm
Infrastructure as Code

Infrastructure as Code

  • Terraform, Ansible, Helm
Cloud Infrastructure

Cloud Foundations

  • AWS, GCP, Kubernetes, Terraform
  • Infrastructure as Code, GitOps
  • High Availability Architecture
CI/CD Tools

Delivery Tooling

  • Jenkins, CircleCI, GitHub Actions
Monitoring and Logging

Observability Stack

  • Grafana, Prometheus, ELK Stack, Open Telemetry, Honeycomb
Distributed Systems

Distributed Systems

  • Distributed Computing, Scalable Architecture
  • Message Queues, Event-Driven Systems
  • Consensus Protocols, Eventual Consistency
Database Management

Data Stores

  • PostgreSQL, MongoDB, MySQL, Cassandra
Server and Version Control Management

Compute & Version Control

  • Linux, Windows, GitHub, GitLab
Project Management Tools

Delivery & Collaboration

  • Asana, Jira, Confluence, Opsgenie

Research

Research & Open Source

Completed Work

VGAC — Predictable GPU Scheduling

Production deployment on EKS, tracking 12,000+ job events

  • AUROC: 0.969, ECE: 0.005 — well-calibrated predictions
  • Predicts job start times for GPU clusters (Slurm/Kubernetes)
  • Live at vgac.cloud

LLM Observability for Log Analysis

Schema-strict log intelligence with statistical validation

  • Calibration metrics, McNemar's test, bootstrap CIs
  • Confidence-gated alerting to reduce noise
  • Cost-aware inference tracking

Current Research Areas

  • Agentic AI & Safe Agents: verifiable, policy‑bounded autonomous agents for regulated and unregulated industries
  • ML in Regulated & Unregulated Industries: compliant, auditable ML for high-stakes workflows
  • AI Governance & Observability: reasoning audits, calibration (ECE/Brier), confidence‑gated alerting
  • ML/GPU Cluster Efficiency: under‑utilization detection, wait‑time risk modeling, right‑sizing
  • eBPF & Systems Telemetry: low‑overhead kernel/network tracing for AI workloads

Open Source

Speaking

Talks, Events & Community

Upcoming Events

MCP Dev Summit North America

New York, NY  |  April 2–3, 2026

Accepted

From 60 Minutes To 60 Seconds: Production MCP Workflows for Healthcare Billing

How we integrated MCP into Kustode's multi-tenant RCM platform to automate denial management, prior authorization, and claim intervention — multi-tenant MCP architecture with PHI isolation and security patterns for regulated environments.

ISS 2026 — Improving Scientific Software Conference

NCAR, Boulder, CO  |  April 6–9, 2026

Accepted + Travel Support

VGAC: Building a GPU Cluster Observability Platform with Predictive Queue Intelligence

Open-source observability platform that transforms passive GPU cluster monitoring into proactive scheduling intelligence through calibrated queue delay predictions. Deployed on production EKS with AUC-ROC 0.756 and ECE 0.077.

PlatformCon 2026

Virtual  |  2026

Accepted

When AI Ships Faster Than Humans Can Review: AI-Native CI/CD Pipelines

The shift from AI-assisted to AI-native CI/CD — automated quality gates, AI code review with feedback loops, deploy safety, and metrics-driven governance across 12+ microservices at Kustode.

Past Talks

Learning Behavioral Signals for ML Cluster Efficiency

Google NY SRE Tech Talks (Dec 16, 2025). Using standard cluster logs to predict under‑utilization and queue risk — toward proactive, data‑driven reliability.

SRE for Streaming AI: Building Resilient Platforms to Combat Model Drift

Data Streaming Summit (Virtual) 2025 — Applying SRE principles to detect and respond to model drift in real time across streaming architectures.

From Chaos to Clarity: Platform Engineering & Observability (KCD NYC)

KCD New York roundtable recap on correlating telemetry, reducing alert fatigue, and using IDPs and LLMs to improve signal and standardization.

Portfolio

Selected Impact

SENTINEL — AI Reasoning Observatory for MCP Agent Governance

Pre-decision reasoning auditor for AI agents. Intercepts decisions before they execute, auditing evidence quality at the signal level — secured by Solo.io's agentgateway at the MCP protocol layer.

Hackathon submission: Secure & Govern MCP — aihackathon.dev

  • Caught 89%-confident agent making 23%-accurate decisions
  • Payer-specific drift detection (Aetna: 84% → 44%)
  • MCP tool-level RBAC via agentgateway + CEL policies
  • Sponsor integrations: Datadog, Braintrust, Cleric, ElevenLabs

AEGIS — Real-Time Security Proxy for Healthcare AI Agents

Three-layer defense proxy (ML classifier + hardened LLM auditor + output sanitizer) that protects healthcare AI agents from prompt injection and PHI exfiltration. Built in Go with ONNX Runtime for sub-100ms inference.

Being integrated into the Kustode production platform for protecting MCP-based healthcare billing agents.

  • 0.25% attack success rate (target: ≤ 10%)
  • 0.00% PHI leakage across 18 HIPAA identifier types
  • 45.4ms p50 latency — fits in the request path
  • 99.49% overall accuracy with ECE 0.0038 calibration
  • Trained on 21,643 samples (Tensor Trust, HackAPrompt, MedQA-USMLE)

VGAC — Predictable Scheduling for GPU Clusters

Product and research initiative that learns cluster behavior (Slurm/Kubernetes) to predict job start times, improve utilization, and enable data‑driven capacity planning.

Results at a glance:

  • Reliable start‑time predictions (e.g., ±15 min) reduce “when will it run?” uncertainty
  • Higher experiment velocity with clear best‑time submission guidance
  • Improved GPU utilization via pattern‑aware scheduling insights
  • Faster capacity decisions with visibility into queue dynamics and bottlenecks

Proactive Model Quality Monitoring

Introduced a drift‑aware monitoring capability that reduces undetected model decay and accelerates response time with adaptive thresholds and real‑time dashboards.

Outcome: Faster incident detection; improved ML service reliability.

  • MTTR reduced for model incidents
  • Early‑warning signals on data drift
  • Dashboards for leadership visibility

Executive Observability: Log Intelligence

Delivered structured, confidence‑scored incident intelligence from raw logs to reduce triage time and elevate operational decision‑making.

Outcome: Shorter MTTR, clearer executive signal, and cost‑aware analysis.

  • Triage time reduced with structured alerts
  • Confidence‑gated notifications to limit noise
  • Cost/performance tracked for governance

Resilient Data Platform Foundation

Established a fault‑tolerant data substrate leveraging consistent hashing and replication for predictable scale and uptime under node churn.

Outcome: Higher availability and predictable performance at scale.

  • Graceful degradation under failures
  • Consistent performance at growth
  • Operational runbooks and SLOs

Risk Forecasting Pipeline

Built an explainable risk‑scoring workflow to support credit decisions and reduce exposure through early warning signals.

Outcome: Better risk visibility; faster portfolio interventions.

  • Explainable scores for decisions
  • Earlier detection of high‑risk cases
  • Operational dashboards for stakeholders

Data Observability Modernization

Consolidated metrics, logs, and traces into actionable views with automated alerting and SLO‑aligned dashboards.

Outcome: Faster detection, clearer ownership, improved reliability.

Kustode — Production Multi-Tenant SaaS Platform

Co-Founded and architected from zero to production — a distributed, event-driven platform handling real-time transaction processing with enterprise-grade reliability.

Architecture & Engineering:

  • 12+ Microservices with event-driven architecture, async messaging (SNS/SQS patterns), and clean API boundaries
  • Multi-Tenant Isolation: PostgreSQL row-level security, organization-scoped queries, zero cross-tenant data leakage
  • Real-Time Pipelines: Webhook ingestion, idempotent processing, state machines, automatic reconciliation
  • Resilience Patterns: Circuit breakers, retry with backoff, dead-letter queues, graceful degradation

Observability & SRE:

  • Distributed Tracing: OpenTelemetry instrumentation across HTTP, async tasks, and DB calls
  • Metrics & Dashboards: Prometheus + Grafana with latency histograms, error budgets, SLO tracking
  • Structured Logging: Correlation IDs, trace context, queryable in Jaeger

Results:

  • 99.9%+ availability with automated alerting and incident response
  • 60% MTTR reduction through end-to-end tracing and structured debugging
  • 40% p95 latency improvement via query optimization, connection pooling, N+1 elimination
  • 25% infra cost savings through right-sizing and efficient resource utilization

Stack: Python/FastAPI, React/Next.js, PostgreSQL, Redis, AWS (EC2, RDS, S3, SNS, CloudWatch), Docker, Nginx, Ansible, OpenTelemetry, Grafana, Prometheus, Jaeger

Retail Operations & Finance Suite

Delivered a cross‑platform solution for operations, reconciliation, and financial controls with built‑in resilience.

  • Cross‑platform access (desktop) with robust state management
  • Automated reconciliation and tax workflows to reduce manual effort
  • Hardened backup and export capabilities for continuity

Technologies: Electron, React, Redux, SQLite, Node.js, Material-UI

Unified Microservices Telemetry

Standardized tracing and metrics to create one source of truth across services, enabling faster diagnosis and capacity planning.

Outcome: Lower MTTR; better planning signals for scaling.

LLM & NLP Enablement

Operationalized model training/inference pipelines on GPUs with governance for reliability and cost.

Outcome: Faster iteration and stable deployment pathways.

Enterprise Cloud Migration

Led modernization to AWS with IaC and automated delivery, minimizing downtime and accelerating release cadence.

Outcome: Reduced risk, faster time‑to‑value, standardized operations.

HPC for Research Acceleration

Provisioned a scalable compute backbone with containerized workflows and batch scheduling to speed research cycles.

Outcome: Higher throughput on data‑intensive workloads.

Security & Compliance Hardening

Implemented policy‑driven security controls and monitoring for regulated environments with automated checks and audit trails.

Outcome: Reduced risk profile and improved compliance readiness.

Prompt-Engineering for Open-Source LLMs

Developed prompt-engineering techniques for enhancing the performance of open-source large language models (LLMs). Implemented and fine-tuned prompts to improve the accuracy and applicability of LLMs in various use cases, such as text summarization, sentiment analysis, and information extraction.

Tools Used: Python, TensorFlow, PyTorch, OpenAI GPT, Docker.

Data Science Platform Enablement

Standardized notebook environments and governed execution paths to improve reproducibility and collaboration.

Outcome: Faster onboarding and consistent results across teams.

Parallel Compute on AWS

Enabled managed HPC capabilities (Batch/Slurm) for elastic, policy‑driven compute aligned to workload needs.

Outcome: Elastic scale with governance and cost control.

Serverless Modernization

Adopted serverless building blocks (Lambda, API, identity, CDN) with observability to reduce ops burden and time‑to‑market.

Outcome: Lower run costs and simplified operations.

LLM Calibration & Reliability Observatory

Interactive simulation demonstrating isotonic calibration for LLM confidence scores, drift detection with automated recalibration, and building SLIs/SLOs for semantic reliability.

Try it live: Run the simulation, inject drift, and watch the system detect, alert, and self-heal.

  • Isotonic regression (PAV) for confidence calibration
  • ECE, Brier, MCE metrics with bootstrap CI
  • Automated drift detection & recalibration
  • SLI/SLO framework with burn-rate alerts

Say Hello

Let's Connect

Whether it's about infrastructure challenges, research ideas, or just a good tech conversation — I'd love to hear from you.

Topics I Love to Chat About:

  • Platform Engineering & Cloud Architecture
  • ML Infrastructure & GPU Clusters
  • Observability & SRE
  • Open Source & Research Collaboration

I'm always open to interesting conversations about infrastructure, distributed systems, or research ideas. If you're working on something cool or just want to chat about tech, reach out!

Things I'm Currently Interested In:

  • GPU cluster optimization & scheduling
  • eBPF for low-overhead observability
  • ML infrastructure at scale
  • Cost-aware cloud architectures
  • Open source collaboration

Currently exploring GPU computing research, eBPF for observability, and sustainable cloud architectures. Always happy to exchange ideas!

Let's Chat