Kai Tanaka

Senior Platform Engineer

I make infrastructure disappear. Kubernetes, GitOps, and the kind of observability that pages you before customers notice.

About

Nine years building and operating production Kubernetes clusters serving 50K+ rps across three cloud providers.
Built the platform that lets 200 engineers deploy to production 80 times per day with zero-downtime rollouts.
Maintain four CNCF-adjacent open-source tools with a combined 8K GitHub stars.
Deeply allergic to tickets that say "the deploy broke" — I build the systems that prevent them.

Experience

Senior Platform Engineer Datadog Remote (Portland, OR) 2022 – Present

Internal platform team building deployment, observability, and developer tools for 1,200 engineers.
- Designed the multi-cluster GitOps deployment system handling 800+ microservices across 4 regions; zero-downtime canary rollouts reduced incident rate by 62%.
- Built the self-service namespace provisioning system that cut team onboarding from 3 weeks to 2 hours.
- Led the migration from Helm to Kustomize + Argo CD across 120 services; reduced config drift incidents to near-zero.
- On-call rotation lead; drove post-incident reviews that reduced MTTR from 45 to 12 minutes over 18 months.
KubernetesArgo CDTerraformGoPrometheus
Infrastructure Engineer Shopify Ottawa, ON (remote) 2019 – 2022

Core infrastructure team supporting Shopify's multi-region Kubernetes platform.
- Co-architected the Black Friday/Cyber Monday capacity planning system; handled 1.3M rps peak without manual intervention.
- Built the cost attribution pipeline that tagged $42M/year in cloud spend to individual teams; drove a 28% reduction in waste.
- Implemented pod security policies and network policies across 6,000 namespaces; passed SOC 2 Type II audit with zero findings.
- Mentored 5 junior engineers; 3 promoted to mid-level within 18 months.
KubernetesGCPTerraformRubyPrometheus
Site Reliability Engineer New Relic Portland, OR 2017 – 2019

SRE for the core ingest pipeline processing 1 TB/hour of telemetry data.
- Reduced the Kafka consumer lag from 45 minutes to under 30 seconds through partition rebalancing and consumer tuning.
- Built the automated runbook system that resolved 40% of pages without human intervention.
- Authored the incident response playbook adopted company-wide.
KafkaAWSAnsiblePythonGrafana

Certifications

2023

Certified Kubernetes Administrator (CKA)

CNCF / Linux Foundation

2024

Certified Kubernetes Security Specialist (CKS)

CNCF / Linux Foundation

2022

AWS Solutions Architect — Professional

Amazon Web Services

2021

HashiCorp Certified: Terraform Associate

HashiCorp

2020

Google Professional Cloud Architect

Google Cloud

Skills

Orchestration: KubernetesArgo CDFluxHelmKustomizeIstio
Infrastructure: TerraformPulumiCrossplaneAWSGCPAzure
Observability: PrometheusGrafanaDatadogOpenTelemetryLokiTempo
Languages: GoPythonRustBashHCLRego
Data / Messaging: KafkaNATSRedisPostgresetcd
Practices: GitOpsSREChaos engineeringIncident responseCapacity planning

Open Source

kube-janitor
2021–

A Kubernetes controller that automatically cleans up stale preview environments and expired resources based on TTL annotations. 3.2K stars.

GoKubernetescontroller-runtime
tf-cost-guard
2022

A Terraform plan analyzer that estimates cost impact and blocks PRs exceeding budget thresholds. Integrates with GitHub Actions and GitLab CI.

GoTerraformGitHub Actions
prom-aggregator
2023

A Prometheus federation proxy that pre-aggregates high-cardinality metrics before they hit Thanos. Reduced Thanos query latency by 70% at Datadog.

RustPrometheus

Education

BSc, Computer Science Oregon State University 2017

Languages

English (Native)Japanese (Fluent)