About
- Nine years building and operating production Kubernetes clusters serving 50K+ rps across three cloud providers.
- Built the platform that lets 200 engineers deploy to production 80 times per day with zero-downtime rollouts.
- Maintain four CNCF-adjacent open-source tools with a combined 8K GitHub stars.
- Deeply allergic to tickets that say "the deploy broke" — I build the systems that prevent them.
Experience
-
Senior Platform Engineer
·
Datadog
Remote (Portland, OR)
2022 – Present
Internal platform team building deployment, observability, and developer tools for 1,200 engineers.
- Designed the multi-cluster GitOps deployment system handling 800+ microservices across 4 regions; zero-downtime canary rollouts reduced incident rate by 62%.
- Built the self-service namespace provisioning system that cut team onboarding from 3 weeks to 2 hours.
- Led the migration from Helm to Kustomize + Argo CD across 120 services; reduced config drift incidents to near-zero.
- On-call rotation lead; drove post-incident reviews that reduced MTTR from 45 to 12 minutes over 18 months.
KubernetesArgo CDTerraformGoPrometheus
-
Infrastructure Engineer
·
Shopify
Ottawa, ON (remote)
2019 – 2022
Core infrastructure team supporting Shopify's multi-region Kubernetes platform.
- Co-architected the Black Friday/Cyber Monday capacity planning system; handled 1.3M rps peak without manual intervention.
- Built the cost attribution pipeline that tagged $42M/year in cloud spend to individual teams; drove a 28% reduction in waste.
- Implemented pod security policies and network policies across 6,000 namespaces; passed SOC 2 Type II audit with zero findings.
- Mentored 5 junior engineers; 3 promoted to mid-level within 18 months.
KubernetesGCPTerraformRubyPrometheus
-
Site Reliability Engineer
·
New Relic
Portland, OR
2017 – 2019
SRE for the core ingest pipeline processing 1 TB/hour of telemetry data.
- Reduced the Kafka consumer lag from 45 minutes to under 30 seconds through partition rebalancing and consumer tuning.
- Built the automated runbook system that resolved 40% of pages without human intervention.
- Authored the incident response playbook adopted company-wide.
KafkaAWSAnsiblePythonGrafana
Certifications
2023
Certified Kubernetes Administrator (CKA)
CNCF / Linux Foundation
2024
Certified Kubernetes Security Specialist (CKS)
CNCF / Linux Foundation
2022
AWS Solutions Architect — Professional
Amazon Web Services
2021
HashiCorp Certified: Terraform Associate
HashiCorp
2020
Google Professional Cloud Architect
Google Cloud
Skills
- Orchestration
-
KubernetesArgo CDFluxHelmKustomizeIstio
- Infrastructure
-
TerraformPulumiCrossplaneAWSGCPAzure
- Observability
-
PrometheusGrafanaDatadogOpenTelemetryLokiTempo
- Languages
-
GoPythonRustBashHCLRego
- Data / Messaging
-
KafkaNATSRedisPostgresetcd
- Practices
-
GitOpsSREChaos engineeringIncident responseCapacity planning
Open Source
-
A Kubernetes controller that automatically cleans up stale preview environments and expired resources based on TTL annotations. 3.2K stars.
GoKubernetescontroller-runtime
-
A Terraform plan analyzer that estimates cost impact and blocks PRs exceeding budget thresholds. Integrates with GitHub Actions and GitLab CI.
GoTerraformGitHub Actions
-
A Prometheus federation proxy that pre-aggregates high-cardinality metrics before they hit Thanos. Reduced Thanos query latency by 70% at Datadog.
RustPrometheus
Education
-
BSc, Computer Science
·
Oregon State University
2017
Languages
English (Native), Japanese (Fluent)