Introduce
Modern Infrastructure, Operational Clarity, and Confident Delivery| From cloud strategy to reliable platforms that ship at scale
Platform Engineering • SRE • Cloud Strategy • Observability • ML/GPU Enablement
I'm passionate about creating and optimizing the technological foundations that businesses rely on, turning complex infrastructure challenges into elegant, efficient solutions that scale seamlessly and perform reliably
8+
Years of
Experience
30+
projects completed on
different technologies
About
Summary
I build and operate infrastructure for complex systems—from GPU clusters and ML platforms to distributed microservices and HPC environments. With 8+ years of hands-on production experience, I combine platform engineering with research in reliability and cluster optimization to build systems that scale effortlessly.
What I Do
I architect cloud-native platforms that balance reliability, performance, and cost. From orchestrating Kubernetes at scale to building ML training infrastructure and observability stacks, I focus on systems that grow with your business—without ballooning your cloud bill or waking your on-call.
Technical DNA
- Cloud: AWS, GCP, deep serverless expertise
- Orchestration: Kubernetes, Docker Swarm at scale
- IaC: Terraform, automated deployments that eliminate human error
- Languages: Python, Golang for robust automation and tooling
- DevOps: Jenkins, GitOps, reliable CI/CD pipelines
- Data & ML Infra: Platforms for complex ML workflows
Beyond the Code
- Proactive observability that prevents user impact
- 30–50% cloud cost reductions via intelligent resource management
- Self-service platforms that enable developer velocity
- Active open-source contribution and community leadership
Experience
Roles & Impact
Co-Founder & Founding Engineer
Built a production-grade, multi-tenant SaaS platform from zero — owning architecture, infrastructure, and delivery.
Distributed Systems & Architecture
- Designed event-driven microservices architecture with 12+ services, async messaging (SNS/SQS patterns), and clean API contracts.
- Implemented multi-tenant isolation at database layer using PostgreSQL with row-level security and foreign key constraints — zero cross-tenant data leakage.
- Built real-time data pipelines processing external EDI transactions with webhook ingestion, idempotent processing, and automatic state reconciliation.
- Designed resilient integration patterns: retry with exponential backoff, circuit breakers, dead-letter queues, and graceful degradation.
Observability & SRE
- Instrumented end-to-end distributed tracing with OpenTelemetry — trace context propagation across HTTP, async tasks, and database calls.
- Built Grafana dashboards with Prometheus metrics: service health, latency histograms, error budgets, and business KPIs.
- Established SLO framework (99.9% availability, p95 latency targets) with automated alerting and on-call runbooks.
- Reduced MTTR by 60% through structured logging (structlog), correlation IDs, and queryable traces in Jaeger.
Security Engineering
- Implemented zero-trust auth: JWT with refresh token rotation, RBAC with permission-based access control, and session management.
- Built defense-in-depth: WAF rules, rate limiting, input validation, SQL injection prevention, and audit logging.
- Automated security scanning in CI/CD: static analysis (Bandit), secret detection (Gitleaks), dependency audits.
- Designed secrets management with Ansible Vault — no plaintext credentials in repos or containers.
Infrastructure & Platform
- Provisioned AWS infrastructure (EC2, RDS, S3, CloudWatch, SNS) with Ansible playbooks for reproducible multi-environment deployments.
- Built CI/CD pipelines: GitHub Actions with parallel testing, lint gates, security checks, and blue-green deployments.
- Achieved 40% p95 latency reduction via query optimization, connection pooling (asyncpg), and N+1 elimination.
- Delivered 25% infrastructure cost savings through right-sizing, reserved capacity, and efficient resource utilization.
Stack: Python/FastAPI, React/Next.js, PostgreSQL, Redis, AWS, Docker, Nginx, Ansible, OpenTelemetry, Grafana, Prometheus, Jaeger
Site Reliability Engineer
Sportserve — Real-Time Sports Data Platform
Owned reliability for high-throughput streaming infrastructure processing millions of events/day on GCP.
Reliability & Performance
- Maintained 99.9% uptime and 99.95% data integrity across distributed streaming services handling peak loads of 50K+ events/sec.
- Optimized BigQuery pipelines and Pub/Sub consumers — reduced query costs by 30% and p99 latency by 40%.
- Implemented capacity planning models using historical traffic patterns to prevent over-provisioning.
Observability & Incident Response
- Built unified observability stack: GCP Monitoring, Prometheus, OpenTelemetry with custom dashboards for real-time anomaly detection.
- Led incident response and blameless postmortems — reduced repeat incidents by 50% through systematic action items.
- Implemented SLO-based alerting with error budgets — cut alert noise by 60% while improving signal quality.
Stack: GCP (BigQuery, Pub/Sub, GKE, Cloud Functions), Kubernetes, Prometheus, Grafana, OpenTelemetry, Python, Go
Research Software Engineer
EcoHealth Alliance — Scientific Computing & ML Infrastructure
Built ML infrastructure and GPU compute platforms for research workloads.
ML & GPU Infrastructure
- Deployed and optimized LLMs on GPU clusters — improved training throughput by 35% through batch tuning and mixed-precision training.
- Built model serving infrastructure with auto-scaling, health checks, and A/B deployment capabilities.
- Implemented GPU resource scheduling with fair-share policies and utilization monitoring.
Cloud & Automation
- Architected AWS serverless platform (Lambda, Step Functions, S3) — reduced operational costs by 30%.
- Automated infrastructure with Terraform + GitOps — cut provisioning time from days to minutes.
- Built containerized HPC workflows on Kubernetes — improved reproducibility and reduced analysis bottlenecks by 50%.
Stack: AWS (Lambda, EC2, S3, SageMaker), Kubernetes, Docker, Terraform, Python, PyTorch, CUDA
DevOps Tech Lead
Sarami — Fintech Platform
Led platform modernization from monolith to microservices.
Platform Modernization
- Led monolith-to-microservices migration on AWS — reduced deployment time by 60% and improved fault isolation.
- Built GitOps CI/CD pipeline (GitLab CI + Terraform + Ansible) — accelerated releases from weekly to daily deployments.
- Implemented infrastructure as code with versioned, peer-reviewed changes and automated rollbacks.
Observability
- Deployed Prometheus + Grafana stack with custom alerting rules — reduced MTTR by 40%.
- Implemented centralized logging (ELK) with structured formats and correlation IDs.
Stack: AWS (ECS, RDS, ElastiCache), Docker, Terraform, Ansible, GitLab CI, Prometheus, Grafana, ELK
Platform Engineer / DevOps Engineer
ICIPE — High-Performance Computing & Data Platforms
Built and operated HPC infrastructure for compute-intensive scientific workloads.
HPC & Compute Infrastructure
- Architected hybrid HPC platform (AWS + on-prem) with Slurm scheduler — scaled to 500+ concurrent jobs.
- Optimized batch processing pipelines — reduced workflow execution time by 50% through parallelization and resource tuning.
- Implemented container orchestration (Docker, Singularity, Kubernetes) for reproducible scientific workflows.
CI/CD & Automation
- Built CI/CD pipelines (Jenkins, CircleCI) with automated testing, container builds, and deployment gates.
- Implemented Prometheus/Grafana monitoring — achieved 90%+ uptime with proactive alerting.
- Automated infrastructure provisioning with Terraform — reduced setup time from weeks to hours.
Data Engineering
- Built ETL pipelines processing TB-scale datasets with SQL-based transformations and data quality checks.
- Designed ML pipelines for real-time data analysis with model versioning and experiment tracking.
Stack: AWS, Slurm, Kubernetes, Docker, Singularity, Jenkins, Terraform, Prometheus, Grafana, PostgreSQL, Python
Education
BSc in Computer Technology
Jomo Kenyatta University of Agriculture and Technology
Certifications
Expertise
Technical Capabilities
Cloud Strategy & Scale
Design and govern cloud foundations (AWS/GCP) aligned to Well‑Architected principles for reliability, performance, and cost control.
Multiple ProjectsContainer Platforms
Operate resilient Kubernetes/container platforms with policy, security, and golden paths for application teams.
Multiple ProjectsObservability & SLOs
End‑to‑end telemetry (logs/metrics/traces) and SLO programs for clear ownership, faster diagnosis, and proactive reliability.
Multiple ProjectsData Platforms
Build data foundations for analytics/ML with governance, lineage, and scalable pipelines.
Multiple ProjectsDelivery Automation
Standardize CI/CD with security and compliance built‑in to accelerate releases safely.
Multiple ProjectsDistributed & HPC
Enable high‑throughput compute and storage for intensive and scientific workloads.
Multiple ProjectsML & GPU Enablement
Operationalize training/inference on GPUs with capacity planning and cost governance.
Multiple ProjectsInfrastructure as Code
Codify infrastructure and policy for repeatability, auditability, and speed.
Multiple ProjectsCapabilities
Technical Foundations
GPU & AI Infrastructure
- GPU Resource Management, CUDA, Performance Optimization
- Kubernetes GPU Operator, Resource Scheduling
- Machine Learning Infrastructure, Model Deployment
Automation Languages
- Python, Java, Golang, JavaScript, R, Rust
Container & Orchestration Platforms
- Docker, Singularity, Kubernetes, Docker Swarm
Infrastructure as Code
- Terraform, Ansible, Helm
Cloud Foundations
- AWS, GCP, Kubernetes, Terraform
- Infrastructure as Code, GitOps
- High Availability Architecture
Delivery Tooling
- Jenkins, CircleCI, GitHub Actions
Observability Stack
- Grafana, Prometheus, ELK Stack, Open Telemetry, Honeycomb
Distributed Systems
- Distributed Computing, Scalable Architecture
- Message Queues, Event-Driven Systems
- Consensus Protocols, Eventual Consistency
Data Stores
- PostgreSQL, MongoDB, MySQL, Cassandra
Compute & Version Control
- Linux, Windows, GitHub, GitLab
Delivery & Collaboration
- Asana, Jira, Confluence, Opsgenie
Research
Research & Open Source
Completed Work
VGAC — Predictable GPU Scheduling
Production deployment on EKS, tracking 12,000+ job events
- AUROC: 0.969, ECE: 0.005 — well-calibrated predictions
- Predicts job start times for GPU clusters (Slurm/Kubernetes)
- Live at vgac.cloud
LLM Observability for Log Analysis
Schema-strict log intelligence with statistical validation
- Calibration metrics, McNemar's test, bootstrap CIs
- Confidence-gated alerting to reduce noise
- Cost-aware inference tracking
Current Research Areas
- ML/GPU Cluster Efficiency: under‑utilization detection, wait‑time risk modeling, right‑sizing
- Observability & Reliability: calibration (ECE/Brier), confidence‑gated alerting
- eBPF & Systems Telemetry: low‑overhead kernel/network tracing
- Cost‑Aware Cloud: Pareto‑efficient infra, sustainable architectures
Open Source
- llm-observability — LLM-powered log analysis with calibration
- Calibrated-Queue-Delay-Prediction — GPU queue wait-time forecasting
- k8s-debug-tui — Terminal UI for Kubernetes debugging
- gpu-ml-framework — GPU cluster ML infrastructure
Talks
Talks & Community
Learning Behavioral Signals for ML Cluster Efficiency
Google NY SRE Tech Talks (Dec 16, 2025). Using standard cluster logs to predict under‑utilization and queue risk—toward proactive, data‑driven reliability.
SRE for Streaming AI: Building Resilient Platforms to Combat Model Drift
Data Streaming Summit (Virtual) 2025 — Applying SRE principles to detect and respond to model drift in real time across streaming architectures.
From Chaos to Clarity: Platform Engineering & Observability (KCD NYC)
KCD New York roundtable recap on correlating telemetry, reducing alert fatigue, and using IDPs and LLMs to improve signal and standardization.
Portfolio
Selected Impact
VGAC — Predictable Scheduling for GPU Clusters
Product and research initiative that learns cluster behavior (Slurm/Kubernetes) to predict job start times, improve utilization, and enable data‑driven capacity planning.
Results at a glance:
- Reliable start‑time predictions (e.g., ±15 min) reduce “when will it run?” uncertainty
- Higher experiment velocity with clear best‑time submission guidance
- Improved GPU utilization via pattern‑aware scheduling insights
- Faster capacity decisions with visibility into queue dynamics and bottlenecks
Proactive Model Quality Monitoring
Introduced a drift‑aware monitoring capability that reduces undetected model decay and accelerates response time with adaptive thresholds and real‑time dashboards.
Outcome: Faster incident detection; improved ML service reliability.
- MTTR reduced for model incidents
- Early‑warning signals on data drift
- Dashboards for leadership visibility
Executive Observability: Log Intelligence
Delivered structured, confidence‑scored incident intelligence from raw logs to reduce triage time and elevate operational decision‑making.
Outcome: Shorter MTTR, clearer executive signal, and cost‑aware analysis.
- Triage time reduced with structured alerts
- Confidence‑gated notifications to limit noise
- Cost/performance tracked for governance
Resilient Data Platform Foundation
Established a fault‑tolerant data substrate leveraging consistent hashing and replication for predictable scale and uptime under node churn.
Outcome: Higher availability and predictable performance at scale.
- Graceful degradation under failures
- Consistent performance at growth
- Operational runbooks and SLOs
Risk Forecasting Pipeline
Built an explainable risk‑scoring workflow to support credit decisions and reduce exposure through early warning signals.
Outcome: Better risk visibility; faster portfolio interventions.
- Explainable scores for decisions
- Earlier detection of high‑risk cases
- Operational dashboards for stakeholders
Data Observability Modernization
Consolidated metrics, logs, and traces into actionable views with automated alerting and SLO‑aligned dashboards.
Outcome: Faster detection, clearer ownership, improved reliability.
Kustode — Production Multi-Tenant SaaS Platform
Co-Founded and architected from zero to production — a distributed, event-driven platform handling real-time transaction processing with enterprise-grade reliability.
Architecture & Engineering:
- 12+ Microservices with event-driven architecture, async messaging (SNS/SQS patterns), and clean API boundaries
- Multi-Tenant Isolation: PostgreSQL row-level security, organization-scoped queries, zero cross-tenant data leakage
- Real-Time Pipelines: Webhook ingestion, idempotent processing, state machines, automatic reconciliation
- Resilience Patterns: Circuit breakers, retry with backoff, dead-letter queues, graceful degradation
Observability & SRE:
- Distributed Tracing: OpenTelemetry instrumentation across HTTP, async tasks, and DB calls
- Metrics & Dashboards: Prometheus + Grafana with latency histograms, error budgets, SLO tracking
- Structured Logging: Correlation IDs, trace context, queryable in Jaeger
Results:
- 99.9%+ availability with automated alerting and incident response
- 60% MTTR reduction through end-to-end tracing and structured debugging
- 40% p95 latency improvement via query optimization, connection pooling, N+1 elimination
- 25% infra cost savings through right-sizing and efficient resource utilization
Stack: Python/FastAPI, React/Next.js, PostgreSQL, Redis, AWS (EC2, RDS, S3, SNS, CloudWatch), Docker, Nginx, Ansible, OpenTelemetry, Grafana, Prometheus, Jaeger
Retail Operations & Finance Suite
Delivered a cross‑platform solution for operations, reconciliation, and financial controls with built‑in resilience.
- Cross‑platform access (desktop) with robust state management
- Automated reconciliation and tax workflows to reduce manual effort
- Hardened backup and export capabilities for continuity
Technologies: Electron, React, Redux, SQLite, Node.js, Material-UI
Unified Microservices Telemetry
Standardized tracing and metrics to create one source of truth across services, enabling faster diagnosis and capacity planning.
Outcome: Lower MTTR; better planning signals for scaling.
LLM & NLP Enablement
Operationalized model training/inference pipelines on GPUs with governance for reliability and cost.
Outcome: Faster iteration and stable deployment pathways.
Enterprise Cloud Migration
Led modernization to AWS with IaC and automated delivery, minimizing downtime and accelerating release cadence.
Outcome: Reduced risk, faster time‑to‑value, standardized operations.
HPC for Research Acceleration
Provisioned a scalable compute backbone with containerized workflows and batch scheduling to speed research cycles.
Outcome: Higher throughput on data‑intensive workloads.
Security & Compliance Hardening
Implemented policy‑driven security controls and monitoring for regulated environments with automated checks and audit trails.
Outcome: Reduced risk profile and improved compliance readiness.
Prompt-Engineering for Open-Source LLMs
Developed prompt-engineering techniques for enhancing the performance of open-source large language models (LLMs). Implemented and fine-tuned prompts to improve the accuracy and applicability of LLMs in various use cases, such as text summarization, sentiment analysis, and information extraction.
Tools Used: Python, TensorFlow, PyTorch, OpenAI GPT, Docker.
Parallel Compute on AWS
Enabled managed HPC capabilities (Batch/Slurm) for elastic, policy‑driven compute aligned to workload needs.
Outcome: Elastic scale with governance and cost control.
Serverless Modernization
Adopted serverless building blocks (Lambda, API, identity, CDN) with observability to reduce ops burden and time‑to‑market.
Outcome: Lower run costs and simplified operations.
Say Hello
Let's Connect
Whether it's about infrastructure challenges, research ideas, or just a good tech conversation — I'd love to hear from you.
Topics I Love to Chat About:
- Platform Engineering & Cloud Architecture
- ML Infrastructure & GPU Clusters
- Observability & SRE
- Open Source & Research Collaboration
I'm always open to interesting conversations about infrastructure, distributed systems, or research ideas. If you're working on something cool or just want to chat about tech, reach out!
Things I'm Currently Interested In:
- GPU cluster optimization & scheduling
- eBPF for low-overhead observability
- ML infrastructure at scale
- Cost-aware cloud architectures
- Open source collaboration
Currently exploring GPU computing research, eBPF for observability, and sustainable cloud architectures. Always happy to exchange ideas!
Let's Chat