Operations

Site Reliability Engineering

How Google Runs Production Systems

by Niall Richard Murphy, Betsy Beyer & Chris Jones

Published April 16, 2016
Pages 552
ISBN 978-1491929124
Rating 4/5 stars
Buy on Amazon

Summary

The first book to describe Google's approach to service management, covering the theory and practice of Site Reliability Engineering (SRE). It explains how Google's SRE teams focus on finding ways to improve the design and operation of systems to make them more scalable, reliable, and efficient.

Key Takeaways

  • SRE principles: availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning
  • Error budgets as a tool to balance reliability with feature velocity
  • The importance of automation in reducing operational toil
  • Post-mortem culture and learning from failures
  • Service Level Objectives (SLOs) and Service Level Indicators (SLIs)

Why We Recommend It

This book established many of the operational practices that modern platform teams adopt. The SRE principles and practices described here are fundamental to running reliable platforms at scale.

Notable Quotes

"Hope is not a strategy."

— Google SRE Team

"The most important feature of any system is that it continues to work."

— Google SRE Team

Best For

  • Platform engineers responsible for reliability
  • SRE teams and operations professionals
  • Anyone interested in Google's approach to running systems
  • Teams implementing observability and monitoring

Get This Book

* Affiliate links help support The Pragmatic Platform