Joshua Gilman
Lead Site Reliability Engineer
Veteran SRE with 15+ years' experience building and operating large-scale systems for big tech, Web3, data center, and manufacturing industries. I design resilient cloud platforms, cut waste through automation, and lead cross-functional teams to ship reliably for 1K-400K+ users.
At a glance
Lead SRE
IOG / Project Catalyst
Built and ran the Cardano governance platform for 400K+ users.
SRE, Core Infra
Modernized delivery pipelines for 1K+ engineers; reduced toil across DC ops.
Controls & SCADA
Google DCs
Automated security/monitoring, preventing incidents and saving 400K+ hours.
Technical Skills
Cloud & Kubernetes
AWS, Google Cloud, EKS, Kubernetes, Argo CD, Helm, Timoni
IaC & Automation
Terraform, Ansible, Nix, Jsonnet, GitLab CI, Docker
Languages & Scripting
Go, Python, Rust, JavaScript, SQL
Reliability & Observability
SLO/SLI design, Observability (Prometheus, Grafana), Incident response, Runbooks & playbooks, Chaos testing, Cost optimization, Security automation
Experience
Input Output HK
Portland, OR
Lead Site Reliability Engineer
Oct 2022 – Present- Develop, iterate, and maintain infrastructure (Kubernetes, Terraform, Earthly) for operating Cardano's decentralized innovation system (Project Catalyst), completing projects 4-6 months quicker than prior average, supporting 4 funds annually ($50M+ annual distributions), and improving community engagement with platform (200K-400K+ members)
- Architect complete software stack (CUE, Timoni, Argo CD) capable of running blockchain workload and implement safe/reliable methodologies necessary, improving public confidence in an on-chain governance system
- Design and implement automated software testing and delivery pipeline, decreasing testing times by 92% and reducing testing by 100+ hours weekly
- Create documentation and training to enable 20+ software engineers to take ownership of systems and promote strong DevOps culture focusing on platform engineering, saving $250K annually on infrastructure costs
Portland, OR
Site Reliability Engineer (Core Infrastructure Team)
Mar 2022 – Oct 2022- Created multiple software delivery pipelines with Python, improving software delivery reliability by 10% for 1K+ SWEs
- Migrated legacy applications used by 10K+ employees to newer orchestration systems, increasing developer velocity and reducing cognitive load
- Spearheaded first Data Center Modernization Initiative at Google and integrated SRE principles into data center operations, improving visibility into critical data center services and reducing toil
Data Center Technician I/II
Jul 2014 – Mar 2022- Led 2-3 year data center build projects to commission 3 data centers (total 400MW+), achieving top ranking for controls engineering and compliance and securing first ever promotion from Data Center Technician to SRE at Google
- Collaborated with Controls Engineer to maintain and improve complex SCADA software and networking deployments, including medium-sized Cisco-based network and 4 FactoryTalk deployments
- Designed, led, and implemented software-based improvements to SCADA systems (automated security systems, software-based monitoring tools, disaster recovery software), leading to no major cybersecurity incidents and saving 400K+ manual labor hours for data center operations annually
- 15+ Google Awards & Recognition including Top Awards for Control Systems Automated Anti-Virus Solution
Johnson Controls
San Francisco, CA
HVAC Controls Technician
May 2013 – Jul 2014- Supported 20+ sites in the Greater San Francisco Bay Area and performed preventative maintenance on building management solutions, ranking in the top 5% of technicians with a 100% on time project delivery rate
- Responded to and resolved 15-20 BMS service requests weekly, completing 90% of requests within 2-3 hours
- Designed and commissioned 9 HVAC process plants ($100K-$3M+) for technology companies, data centers, skyscraper HVAC systems, manufacturing plants, and laboratory facilities