Introduce

Modern Infrastructure, Operational Clarity, and Confident Delivery| From cloud strategy to reliable platforms that ship at scale

Platform Engineering • SRE • Cloud Strategy • Observability • ML/GPU Enablement

I'm passionate about creating and optimizing the technological foundations that businesses rely on, turning complex infrastructure challenges into elegant, efficient solutions that scale seamlessly and perform reliably

Rounded Text

8+

Years of
Experience

30+

projects completed on
different technologies

About

Summary

I build and operate infrastructure for complex systems—from GPU clusters and ML platforms to distributed microservices and HPC environments. With 8+ years of hands-on production experience, I combine platform engineering with research in reliability and cluster optimization to build systems that scale effortlessly.

What I Do

I architect cloud-native platforms that balance reliability, performance, and cost. From orchestrating Kubernetes at scale to building ML training infrastructure and observability stacks, I focus on systems that grow with your business—without ballooning your cloud bill or waking your on-call.

Technical DNA

  • Cloud: AWS, GCP, deep serverless expertise
  • Orchestration: Kubernetes, Docker Swarm at scale
  • IaC: Terraform, automated deployments that eliminate human error
  • Languages: Python, Golang for robust automation and tooling
  • DevOps: Jenkins, GitOps, reliable CI/CD pipelines
  • Data & ML Infra: Platforms for complex ML workflows

Beyond the Code

  • Proactive observability that prevents user impact
  • 30–50% cloud cost reductions via intelligent resource management
  • Self-service platforms that enable developer velocity
  • Active open-source contribution and community leadership

Experience

Roles & Impact

May 2025 - Present

Co-Founder & Founding Engineer

Kustode

Built a production-grade, multi-tenant SaaS platform from zero — owning architecture, infrastructure, and delivery.

Distributed Systems & Architecture

  • Designed event-driven microservices architecture with 12+ services, async messaging (SNS/SQS patterns), and clean API contracts.
  • Implemented multi-tenant isolation at database layer using PostgreSQL with row-level security and foreign key constraints — zero cross-tenant data leakage.
  • Built real-time data pipelines processing external EDI transactions with webhook ingestion, idempotent processing, and automatic state reconciliation.
  • Designed resilient integration patterns: retry with exponential backoff, circuit breakers, dead-letter queues, and graceful degradation.

Observability & SRE

  • Instrumented end-to-end distributed tracing with OpenTelemetry — trace context propagation across HTTP, async tasks, and database calls.
  • Built Grafana dashboards with Prometheus metrics: service health, latency histograms, error budgets, and business KPIs.
  • Established SLO framework (99.9% availability, p95 latency targets) with automated alerting and on-call runbooks.
  • Reduced MTTR by 60% through structured logging (structlog), correlation IDs, and queryable traces in Jaeger.

Security Engineering

  • Implemented zero-trust auth: JWT with refresh token rotation, RBAC with permission-based access control, and session management.
  • Built defense-in-depth: WAF rules, rate limiting, input validation, SQL injection prevention, and audit logging.
  • Automated security scanning in CI/CD: static analysis (Bandit), secret detection (Gitleaks), dependency audits.
  • Designed secrets management with Ansible Vault — no plaintext credentials in repos or containers.

Infrastructure & Platform

  • Provisioned AWS infrastructure (EC2, RDS, S3, CloudWatch, SNS) with Ansible playbooks for reproducible multi-environment deployments.
  • Built CI/CD pipelines: GitHub Actions with parallel testing, lint gates, security checks, and blue-green deployments.
  • Achieved 40% p95 latency reduction via query optimization, connection pooling (asyncpg), and N+1 elimination.
  • Delivered 25% infrastructure cost savings through right-sizing, reserved capacity, and efficient resource utilization.

Stack: Python/FastAPI, React/Next.js, PostgreSQL, Redis, AWS, Docker, Nginx, Ansible, OpenTelemetry, Grafana, Prometheus, Jaeger

2023 - 2025

Site Reliability Engineer

Sportserve — Real-Time Sports Data Platform

Owned reliability for high-throughput streaming infrastructure processing millions of events/day on GCP.

Reliability & Performance

  • Maintained 99.9% uptime and 99.95% data integrity across distributed streaming services handling peak loads of 50K+ events/sec.
  • Optimized BigQuery pipelines and Pub/Sub consumers — reduced query costs by 30% and p99 latency by 40%.
  • Implemented capacity planning models using historical traffic patterns to prevent over-provisioning.

Observability & Incident Response

  • Built unified observability stack: GCP Monitoring, Prometheus, OpenTelemetry with custom dashboards for real-time anomaly detection.
  • Led incident response and blameless postmortems — reduced repeat incidents by 50% through systematic action items.
  • Implemented SLO-based alerting with error budgets — cut alert noise by 60% while improving signal quality.

Stack: GCP (BigQuery, Pub/Sub, GKE, Cloud Functions), Kubernetes, Prometheus, Grafana, OpenTelemetry, Python, Go

2022 - 2023

Research Software Engineer

EcoHealth Alliance — Scientific Computing & ML Infrastructure

Built ML infrastructure and GPU compute platforms for research workloads.

ML & GPU Infrastructure

  • Deployed and optimized LLMs on GPU clusters — improved training throughput by 35% through batch tuning and mixed-precision training.
  • Built model serving infrastructure with auto-scaling, health checks, and A/B deployment capabilities.
  • Implemented GPU resource scheduling with fair-share policies and utilization monitoring.

Cloud & Automation

  • Architected AWS serverless platform (Lambda, Step Functions, S3) — reduced operational costs by 30%.
  • Automated infrastructure with Terraform + GitOps — cut provisioning time from days to minutes.
  • Built containerized HPC workflows on Kubernetes — improved reproducibility and reduced analysis bottlenecks by 50%.

Stack: AWS (Lambda, EC2, S3, SageMaker), Kubernetes, Docker, Terraform, Python, PyTorch, CUDA

2021 - 2022

DevOps Tech Lead

Sarami — Fintech Platform

Led platform modernization from monolith to microservices.

Platform Modernization

  • Led monolith-to-microservices migration on AWS — reduced deployment time by 60% and improved fault isolation.
  • Built GitOps CI/CD pipeline (GitLab CI + Terraform + Ansible) — accelerated releases from weekly to daily deployments.
  • Implemented infrastructure as code with versioned, peer-reviewed changes and automated rollbacks.

Observability

  • Deployed Prometheus + Grafana stack with custom alerting rules — reduced MTTR by 40%.
  • Implemented centralized logging (ELK) with structured formats and correlation IDs.

Stack: AWS (ECS, RDS, ElastiCache), Docker, Terraform, Ansible, GitLab CI, Prometheus, Grafana, ELK

2018 - 2021

Platform Engineer / DevOps Engineer

ICIPE — High-Performance Computing & Data Platforms

Built and operated HPC infrastructure for compute-intensive scientific workloads.

HPC & Compute Infrastructure

  • Architected hybrid HPC platform (AWS + on-prem) with Slurm scheduler — scaled to 500+ concurrent jobs.
  • Optimized batch processing pipelines — reduced workflow execution time by 50% through parallelization and resource tuning.
  • Implemented container orchestration (Docker, Singularity, Kubernetes) for reproducible scientific workflows.

CI/CD & Automation

  • Built CI/CD pipelines (Jenkins, CircleCI) with automated testing, container builds, and deployment gates.
  • Implemented Prometheus/Grafana monitoring — achieved 90%+ uptime with proactive alerting.
  • Automated infrastructure provisioning with Terraform — reduced setup time from weeks to hours.

Data Engineering

  • Built ETL pipelines processing TB-scale datasets with SQL-based transformations and data quality checks.
  • Designed ML pipelines for real-time data analysis with model versioning and experiment tracking.

Stack: AWS, Slurm, Kubernetes, Docker, Singularity, Jenkins, Terraform, Prometheus, Grafana, PostgreSQL, Python


Education

2024 - 2026

MSc in Data Science

Saint Peter's University

2013 - 2018

BSc in Computer Technology

Jomo Kenyatta University of Agriculture and Technology

Expertise

Technical Capabilities

Cloud Strategy & Scale

Design and govern cloud foundations (AWS/GCP) aligned to Well‑Architected principles for reliability, performance, and cost control.

Multiple Projects

Container Platforms

Operate resilient Kubernetes/container platforms with policy, security, and golden paths for application teams.

Multiple Projects

Observability & SLOs

End‑to‑end telemetry (logs/metrics/traces) and SLO programs for clear ownership, faster diagnosis, and proactive reliability.

Multiple Projects

Data Platforms

Build data foundations for analytics/ML with governance, lineage, and scalable pipelines.

Multiple Projects

Delivery Automation

Standardize CI/CD with security and compliance built‑in to accelerate releases safely.

Multiple Projects

Distributed & HPC

Enable high‑throughput compute and storage for intensive and scientific workloads.

Multiple Projects

ML & GPU Enablement

Operationalize training/inference on GPUs with capacity planning and cost governance.

Multiple Projects

Infrastructure as Code

Codify infrastructure and policy for repeatability, auditability, and speed.

Multiple Projects

Capabilities

Technical Foundations

GPU Computing

GPU & AI Infrastructure

  • GPU Resource Management, CUDA, Performance Optimization
  • Kubernetes GPU Operator, Resource Scheduling
  • Machine Learning Infrastructure, Model Deployment
Programming Languages

Automation Languages

  • Python, Java, Golang, JavaScript, R, Rust
Containerization and Orchestration

Container & Orchestration Platforms

  • Docker, Singularity, Kubernetes, Docker Swarm
Infrastructure as Code

Infrastructure as Code

  • Terraform, Ansible, Helm
Cloud Infrastructure

Cloud Foundations

  • AWS, GCP, Kubernetes, Terraform
  • Infrastructure as Code, GitOps
  • High Availability Architecture
CI/CD Tools

Delivery Tooling

  • Jenkins, CircleCI, GitHub Actions
Monitoring and Logging

Observability Stack

  • Grafana, Prometheus, ELK Stack, Open Telemetry, Honeycomb
Distributed Systems

Distributed Systems

  • Distributed Computing, Scalable Architecture
  • Message Queues, Event-Driven Systems
  • Consensus Protocols, Eventual Consistency
Database Management

Data Stores

  • PostgreSQL, MongoDB, MySQL, Cassandra
Server and Version Control Management

Compute & Version Control

  • Linux, Windows, GitHub, GitLab
Project Management Tools

Delivery & Collaboration

  • Asana, Jira, Confluence, Opsgenie

Research

Research & Open Source

Completed Work

VGAC — Predictable GPU Scheduling

Production deployment on EKS, tracking 12,000+ job events

  • AUROC: 0.969, ECE: 0.005 — well-calibrated predictions
  • Predicts job start times for GPU clusters (Slurm/Kubernetes)
  • Live at vgac.cloud

LLM Observability for Log Analysis

Schema-strict log intelligence with statistical validation

  • Calibration metrics, McNemar's test, bootstrap CIs
  • Confidence-gated alerting to reduce noise
  • Cost-aware inference tracking

Current Research Areas

  • ML/GPU Cluster Efficiency: under‑utilization detection, wait‑time risk modeling, right‑sizing
  • Observability & Reliability: calibration (ECE/Brier), confidence‑gated alerting
  • eBPF & Systems Telemetry: low‑overhead kernel/network tracing
  • Cost‑Aware Cloud: Pareto‑efficient infra, sustainable architectures

Open Source

Talks

Talks & Community

Learning Behavioral Signals for ML Cluster Efficiency

Google NY SRE Tech Talks (Dec 16, 2025). Using standard cluster logs to predict under‑utilization and queue risk—toward proactive, data‑driven reliability.

SRE for Streaming AI: Building Resilient Platforms to Combat Model Drift

Data Streaming Summit (Virtual) 2025 — Applying SRE principles to detect and respond to model drift in real time across streaming architectures.

From Chaos to Clarity: Platform Engineering & Observability (KCD NYC)

KCD New York roundtable recap on correlating telemetry, reducing alert fatigue, and using IDPs and LLMs to improve signal and standardization.

Portfolio

Selected Impact

VGAC — Predictable Scheduling for GPU Clusters

Product and research initiative that learns cluster behavior (Slurm/Kubernetes) to predict job start times, improve utilization, and enable data‑driven capacity planning.

Results at a glance:

  • Reliable start‑time predictions (e.g., ±15 min) reduce “when will it run?” uncertainty
  • Higher experiment velocity with clear best‑time submission guidance
  • Improved GPU utilization via pattern‑aware scheduling insights
  • Faster capacity decisions with visibility into queue dynamics and bottlenecks

Proactive Model Quality Monitoring

Introduced a drift‑aware monitoring capability that reduces undetected model decay and accelerates response time with adaptive thresholds and real‑time dashboards.

Outcome: Faster incident detection; improved ML service reliability.

  • MTTR reduced for model incidents
  • Early‑warning signals on data drift
  • Dashboards for leadership visibility

Executive Observability: Log Intelligence

Delivered structured, confidence‑scored incident intelligence from raw logs to reduce triage time and elevate operational decision‑making.

Outcome: Shorter MTTR, clearer executive signal, and cost‑aware analysis.

  • Triage time reduced with structured alerts
  • Confidence‑gated notifications to limit noise
  • Cost/performance tracked for governance

Resilient Data Platform Foundation

Established a fault‑tolerant data substrate leveraging consistent hashing and replication for predictable scale and uptime under node churn.

Outcome: Higher availability and predictable performance at scale.

  • Graceful degradation under failures
  • Consistent performance at growth
  • Operational runbooks and SLOs

Risk Forecasting Pipeline

Built an explainable risk‑scoring workflow to support credit decisions and reduce exposure through early warning signals.

Outcome: Better risk visibility; faster portfolio interventions.

  • Explainable scores for decisions
  • Earlier detection of high‑risk cases
  • Operational dashboards for stakeholders

Data Observability Modernization

Consolidated metrics, logs, and traces into actionable views with automated alerting and SLO‑aligned dashboards.

Outcome: Faster detection, clearer ownership, improved reliability.

Kustode — Production Multi-Tenant SaaS Platform

Co-Founded and architected from zero to production — a distributed, event-driven platform handling real-time transaction processing with enterprise-grade reliability.

Architecture & Engineering:

  • 12+ Microservices with event-driven architecture, async messaging (SNS/SQS patterns), and clean API boundaries
  • Multi-Tenant Isolation: PostgreSQL row-level security, organization-scoped queries, zero cross-tenant data leakage
  • Real-Time Pipelines: Webhook ingestion, idempotent processing, state machines, automatic reconciliation
  • Resilience Patterns: Circuit breakers, retry with backoff, dead-letter queues, graceful degradation

Observability & SRE:

  • Distributed Tracing: OpenTelemetry instrumentation across HTTP, async tasks, and DB calls
  • Metrics & Dashboards: Prometheus + Grafana with latency histograms, error budgets, SLO tracking
  • Structured Logging: Correlation IDs, trace context, queryable in Jaeger

Results:

  • 99.9%+ availability with automated alerting and incident response
  • 60% MTTR reduction through end-to-end tracing and structured debugging
  • 40% p95 latency improvement via query optimization, connection pooling, N+1 elimination
  • 25% infra cost savings through right-sizing and efficient resource utilization

Stack: Python/FastAPI, React/Next.js, PostgreSQL, Redis, AWS (EC2, RDS, S3, SNS, CloudWatch), Docker, Nginx, Ansible, OpenTelemetry, Grafana, Prometheus, Jaeger

Retail Operations & Finance Suite

Delivered a cross‑platform solution for operations, reconciliation, and financial controls with built‑in resilience.

  • Cross‑platform access (desktop) with robust state management
  • Automated reconciliation and tax workflows to reduce manual effort
  • Hardened backup and export capabilities for continuity

Technologies: Electron, React, Redux, SQLite, Node.js, Material-UI

Unified Microservices Telemetry

Standardized tracing and metrics to create one source of truth across services, enabling faster diagnosis and capacity planning.

Outcome: Lower MTTR; better planning signals for scaling.

LLM & NLP Enablement

Operationalized model training/inference pipelines on GPUs with governance for reliability and cost.

Outcome: Faster iteration and stable deployment pathways.

Enterprise Cloud Migration

Led modernization to AWS with IaC and automated delivery, minimizing downtime and accelerating release cadence.

Outcome: Reduced risk, faster time‑to‑value, standardized operations.

HPC for Research Acceleration

Provisioned a scalable compute backbone with containerized workflows and batch scheduling to speed research cycles.

Outcome: Higher throughput on data‑intensive workloads.

Security & Compliance Hardening

Implemented policy‑driven security controls and monitoring for regulated environments with automated checks and audit trails.

Outcome: Reduced risk profile and improved compliance readiness.

Prompt-Engineering for Open-Source LLMs

Developed prompt-engineering techniques for enhancing the performance of open-source large language models (LLMs). Implemented and fine-tuned prompts to improve the accuracy and applicability of LLMs in various use cases, such as text summarization, sentiment analysis, and information extraction.

Tools Used: Python, TensorFlow, PyTorch, OpenAI GPT, Docker.

Data Science Platform Enablement

Standardized notebook environments and governed execution paths to improve reproducibility and collaboration.

Outcome: Faster onboarding and consistent results across teams.

Parallel Compute on AWS

Enabled managed HPC capabilities (Batch/Slurm) for elastic, policy‑driven compute aligned to workload needs.

Outcome: Elastic scale with governance and cost control.

Serverless Modernization

Adopted serverless building blocks (Lambda, API, identity, CDN) with observability to reduce ops burden and time‑to‑market.

Outcome: Lower run costs and simplified operations.

Say Hello

Let's Connect

Whether it's about infrastructure challenges, research ideas, or just a good tech conversation — I'd love to hear from you.

Topics I Love to Chat About:

  • Platform Engineering & Cloud Architecture
  • ML Infrastructure & GPU Clusters
  • Observability & SRE
  • Open Source & Research Collaboration

I'm always open to interesting conversations about infrastructure, distributed systems, or research ideas. If you're working on something cool or just want to chat about tech, reach out!

Things I'm Currently Interested In:

  • GPU cluster optimization & scheduling
  • eBPF for low-overhead observability
  • ML infrastructure at scale
  • Cost-aware cloud architectures
  • Open source collaboration

Currently exploring GPU computing research, eBPF for observability, and sustainable cloud architectures. Always happy to exchange ideas!

Let's Chat