Introduce

Modern Infrastructure, Operational Clarity, and Confident Delivery| From cloud strategy to reliable platforms that ship at scale

Platform Engineering • SRE • Cloud Strategy • Observability • ML/GPU Enablement

I'm passionate about creating and optimizing the technological foundations that businesses rely on, turning complex infrastructure challenges into elegant, efficient solutions that scale seamlessly and perform reliably

Rounded Text

8+

Years of
Experience

30+

projects completed on
different technologies

About

Summary

I transform infrastructure chaos into elegant, scalable solutions that work. With over seven years of experience architecting systems that power critical applications, I've learned that great infrastructure isn't about complexity—it's about building resilient foundations that scale effortlessly while your team sleeps soundly.

What I Do

I design and implement cloud-native infrastructures that balance reliability, scalability, and cost-efficiency. From orchestrating thousands of containers to building ML pipelines and predictive monitoring, I focus on systems that grow with your business—without ballooning your cloud bill.

Technical DNA

  • Cloud: AWS, GCP, deep serverless expertise
  • Orchestration: Kubernetes, Docker Swarm at scale
  • IaC: Terraform, automated deployments that eliminate human error
  • Languages: Python, Golang for robust automation and tooling
  • DevOps: Jenkins, GitOps, reliable CI/CD pipelines
  • Data & ML Infra: Platforms for complex ML workflows

Beyond the Code

  • Proactive observability that prevents user impact
  • 30–50% cloud cost reductions via intelligent resource management
  • Self-service platforms that enable developer velocity
  • Active open-source contribution and community leadership

Experience

Roles & Impact

May 2025 - Present

Founding Engineer (Platform & Infrastructure)

kustode

  • Co-founded the company and set the platform vision and execution roadmap for a multi-tenant healthcare billing product.
  • Owned technical strategy: defined microservices topology, tenancy model, security posture, and operational SLOs (99.9%+).
  • Built and led cross-functional execution (product, eng, ops), instituting delivery rituals, incident management, and runbooks.
  • Drove performance and cost governance: p95 latency down 40–70%; infra spend optimized 15–35% via right-sizing and policies.
  • Established observability and reliability programs (OpenTelemetry, Grafana, Jaeger) cutting MTTR by 50–70%.
  • Directed compliance and data protection (HIPAA-aligned), tenant isolation, WAF, audit trails, and auth hardening.
2013 - 2018

BSc in Computer Technology

Jomo Kenyatta University of Agriculture and Technology

Professional Certifications

View my verified achievements on

Credly Badges

This link leads to the Credly platform where I have listed all my professional certifications and badges that showcase my specialized skills.


Site Reliability Engineer

Sportserve

  • Ensured 99.9% uptime and 99.95% data integrity across high-performance sports data services.
  • Optimized BigQuery and streaming pipelines; reduced latency and cut infrastructure costs by ~30%.
  • Led incident response and postmortems; improved detection and response time via enriched alerting.
  • Drove observability strategy (GCP Monitoring, Prometheus, OpenTelemetry) across services.

Research Software Engineer

EcoHealth Alliance

  • Architected AWS serverless platforms; reduced operational costs by ~30% while improving scalability.
  • Deployed/optimized LLMs on localized GPU clusters; improved training efficiency and model stability.
  • Automated deployments with Terraform and CI/CD; cut manual work by ~40% and improved reproducibility.
  • Enhanced HPC workflows on Docker/Kubernetes; reduced analysis bottlenecks for scientific workloads.

Site Reliability Engineer

Sport Server

  • Managed large-scale, high-availability sports server infrastructure on Google Cloud Platform, maintaining 99.9% uptime and 99.95% data integrity.
  • Oversaw the full service lifecycle, from design to continuous improvement, ensuring robust and scalable architecture.
  • Integrated Google Cloud tools to enhance data observability and enable real-time quality checks.

DevOps Tech Lead

Sarami · Nairobi, Kenya

  • Migrated to containerized microservices on AWS; cut deployment time by ~60% and improved uptime.
  • Built GitLab CI/CD + Terraform + Ansible; moved releases from weekly to daily.
  • Implemented Prometheus/Grafana with alerting; improved MTTR by ~40%.

System Engineer

ICIPE - International Centre of Insect Physiology and Ecology

  • Built scalable AWS and on-prem HPC platforms; integrated Slurm for parallel computing.
  • Implemented monitoring (Prometheus/Grafana); achieved ~90% uptime and reduced failures.
  • Streamlined pipelines with Docker and Terraform; cut workflow execution times by ~50%.

Data Systems Consultant

ICIPE - International Centre of Insect Physiology and Ecology

  • Optimized data processes for scalability, increasing the efficiency of data operations for research teams.
  • Developed analytical tools to generate actionable insights, supporting strategic decision-making.
  • Integrated and built data analytics platforms tailored for entomological research.
  • Designed machine learning models for real-time monitoring of insect populations, improving research accuracy.

DevOps Engineer

ICIPE - International Centre of Insect Physiology and Ecology

  • Architected and deployed comprehensive CI/CD pipelines using Jenkins and CircleCI, enabling real-time monitoring and cloud integration.
  • Developed and managed Docker and Singularity containers, orchestrating deployments across Kubernetes clusters for scalable infrastructure.
  • Administered high-performance computing (HPC) environments, supporting complex data processing tasks.
  • Optimized batch queuing systems in a massively parallel production setting, increasing processing efficiency and throughput.
  • Conducted system utilization analysis to ensure over 90% uptime and robust system health.
  • Led infrastructure redesign initiatives, automating core processes and enhancing scalability, reducing operational overhead.
  • Constructed advanced SQL-based infrastructure for data extraction and loading, developing analytical tools to transform data into actionable intelligence.
  • Collaborated with executive, product, data, and design teams to address infrastructure needs and technical challenges.
  • Implemented and continually refined data management procedures, improving data system functionality for ongoing research and operations.

Capabilities

Platform Enablement

Cloud Strategy & Scale

Design and govern cloud foundations (AWS/GCP) aligned to Well‑Architected principles for reliability, performance, and cost control.

Multiple Projects

Container Platforms

Operate resilient Kubernetes/container platforms with policy, security, and golden paths for application teams.

Multiple Projects

Observability & SLOs

End‑to‑end telemetry (logs/metrics/traces) and SLO programs for clear ownership, faster diagnosis, and proactive reliability.

Multiple Projects

Data Platforms

Build data foundations for analytics/ML with governance, lineage, and scalable pipelines.

Multiple Projects

Delivery Automation

Standardize CI/CD with security and compliance built‑in to accelerate releases safely.

Multiple Projects

Distributed & HPC

Enable high‑throughput compute and storage for intensive and scientific workloads.

Multiple Projects

ML & GPU Enablement

Operationalize training/inference on GPUs with capacity planning and cost governance.

Multiple Projects

Infrastructure as Code

Codify infrastructure and policy for repeatability, auditability, and speed.

Multiple Projects

Capabilities

Technical Foundations

GPU Computing

GPU & AI Infrastructure

  • GPU Resource Management, CUDA, Performance Optimization
  • Kubernetes GPU Operator, Resource Scheduling
  • Machine Learning Infrastructure, Model Deployment
Programming Languages

Automation Languages

  • Python, Java, Golang, JavaScript, R, Rust
Containerization and Orchestration

Container & Orchestration Platforms

  • Docker, Singularity, Kubernetes, Docker Swarm
Infrastructure as Code

Infrastructure as Code

  • Terraform, Ansible, Helm
Cloud Infrastructure

Cloud Foundations

  • AWS, GCP, Kubernetes, Terraform
  • Infrastructure as Code, GitOps
  • High Availability Architecture
CI/CD Tools

Delivery Tooling

  • Jenkins, CircleCI, GitHub Actions
Monitoring and Logging

Observability Stack

  • Grafana, Prometheus, ELK Stack, Open Telemetry, Honeycomb
Distributed Systems

Distributed Systems

  • Distributed Computing, Scalable Architecture
  • Message Queues, Event-Driven Systems
  • Consensus Protocols, Eventual Consistency
Database Management

Data Stores

  • PostgreSQL, MongoDB, MySQL, Cassandra
Server and Version Control Management

Compute & Version Control

  • Linux, Windows, GitHub, GitLab
Project Management Tools

Delivery & Collaboration

  • Asana, Jira, Confluence, Opsgenie

Research

Current Research Areas

  • Observability & Reliability: calibration (ECE/Brier), confidence‑gated alerting, risk–coverage.
  • ML/GPUs & Cluster Efficiency: under‑utilization detection, wait‑time risk modeling, right‑sizing.
  • eBPF & Systems Telemetry: low‑overhead kernel/network tracing for performance insights.
  • LLMs for Ops: schema‑strict log intelligence, structured outputs, cost‑aware models.
  • Cost‑Aware Cloud: Pareto‑efficient infra, spend governance, sustainable architectures.
  • Reproducibility & Evaluation: CIs, paired tests (McNemar), calibration dashboards.

Work in progress — public artifacts and write‑ups will be shared here as they mature.

Portfolio

Selected Impact

Proactive Model Quality Monitoring

Introduced a drift‑aware monitoring capability that reduces undetected model decay and accelerates response time with adaptive thresholds and real‑time dashboards.

Outcome: Faster incident detection; improved ML service reliability.

  • MTTR reduced for model incidents
  • Early‑warning signals on data drift
  • Dashboards for leadership visibility

Executive Observability: Log Intelligence

Delivered structured, confidence‑scored incident intelligence from raw logs to reduce triage time and elevate operational decision‑making.

Outcome: Shorter MTTR, clearer executive signal, and cost‑aware analysis.

  • Triage time reduced with structured alerts
  • Confidence‑gated notifications to limit noise
  • Cost/performance tracked for governance

Resilient Data Platform Foundation

Established a fault‑tolerant data substrate leveraging consistent hashing and replication for predictable scale and uptime under node churn.

Outcome: Higher availability and predictable performance at scale.

  • Graceful degradation under failures
  • Consistent performance at growth
  • Operational runbooks and SLOs

Risk Forecasting Pipeline

Built an explainable risk‑scoring workflow to support credit decisions and reduce exposure through early warning signals.

Outcome: Better risk visibility; faster portfolio interventions.

  • Explainable scores for decisions
  • Earlier detection of high‑risk cases
  • Operational dashboards for stakeholders

Data Observability Modernization

Consolidated metrics, logs, and traces into actionable views with automated alerting and SLO‑aligned dashboards.

Outcome: Faster detection, clearer ownership, improved reliability.

Healthcare Operations Platform

Orchestrated an end‑to‑end, compliant clinical workflow from referral to payment with embedded controls and auditability.

  • Automated referral, verification, and authorization to reduce cycle time
  • Compliance by design with real‑time checkpoints and audit trails
  • Proactive alerts to prevent SLAs and deadlines from slipping
  • Integrated billing and claims for straight‑through processing

Technologies: React, Node.js, PostgreSQL, Redis, Docker, AWS Services

Retail Operations & Finance Suite

Delivered a cross‑platform solution for operations, reconciliation, and financial controls with built‑in resilience.

  • Cross‑platform access (desktop) with robust state management
  • Automated reconciliation and tax workflows to reduce manual effort
  • Hardened backup and export capabilities for continuity

Technologies: Electron, React, Redux, SQLite, Node.js, Material-UI

Unified Microservices Telemetry

Standardized tracing and metrics to create one source of truth across services, enabling faster diagnosis and capacity planning.

Outcome: Lower MTTR; better planning signals for scaling.

LLM & NLP Enablement

Operationalized model training/inference pipelines on GPUs with governance for reliability and cost.

Outcome: Faster iteration and stable deployment pathways.

Enterprise Cloud Migration

Led modernization to AWS with IaC and automated delivery, minimizing downtime and accelerating release cadence.

Outcome: Reduced risk, faster time‑to‑value, standardized operations.

HPC for Research Acceleration

Provisioned a scalable compute backbone with containerized workflows and batch scheduling to speed research cycles.

Outcome: Higher throughput on data‑intensive workloads.

Security & Compliance Hardening

Implemented policy‑driven security controls and monitoring for regulated environments with automated checks and audit trails.

Outcome: Reduced risk profile and improved compliance readiness.

Prompt-Engineering for Open-Source LLMs

Developed prompt-engineering techniques for enhancing the performance of open-source large language models (LLMs). Implemented and fine-tuned prompts to improve the accuracy and applicability of LLMs in various use cases, such as text summarization, sentiment analysis, and information extraction.

Tools Used: Python, TensorFlow, PyTorch, OpenAI GPT, Docker.

Data Science Platform Enablement

Standardized notebook environments and governed execution paths to improve reproducibility and collaboration.

Outcome: Faster onboarding and consistent results across teams.

Parallel Compute on AWS

Enabled managed HPC capabilities (Batch/Slurm) for elastic, policy‑driven compute aligned to workload needs.

Outcome: Elastic scale with governance and cost control.

Serverless Modernization

Adopted serverless building blocks (Lambda, API, identity, CDN) with observability to reduce ops burden and time‑to‑market.

Outcome: Lower run costs and simplified operations.

Let's Connect

Ready to Build Scalable Solutions?

From Infrastructure to Innovation: Let's Transform Your Vision

Specialized In:

  • Cloud Architecture & Infrastructure
  • AI/ML Systems & GPU Computing
  • Distributed Systems & Scalability
  • DevOps & Site Reliability Engineering

Whether you're scaling your infrastructure, optimizing performance, or implementing cutting-edge AI solutions, I bring the expertise to make it happen. Let's discuss how we can achieve your technical goals while ensuring reliability, scalability, and innovation.

Let's Collaborate On:

  • Building Resilient Cloud Infrastructure
  • Scaling Distributed Systems
  • Optimizing AI/ML Platforms
  • Enhancing System Performance
  • Implementing Site Reliability Best Practices
  • Designing High-Performance Architecture

Let's connect on scaling, migrations, ML infra, or observability—currently exploring GPU computing, eBPF for observability, and sustainable cloud architectures.

Schedule a Consultation