If you were looking for tips on how to run your own cloud infrastructure, there would be a few places that would come immediately to mind. And Google would be one of them.
This is what makes the release of Site Reliability Engineering: How Google Runs Production Systems a bit of an occasion. The 500+ page book features articles and essays edited by Jennifer Petoff, Niall Richard Murphy, Betsy Beyer, and Chris Jones – all members of Google’s Site Reliability Team – discussing the principles, practices, and management of one of the world’s biggest software systems.
Talking about the Site Reliability Engineering (SRE) model in a conversation at the Google Cloud Platform blog, Chris Jones explained:
“SRE is an engineering approach to operating large-scale distributed computing services. Making systems highly standardized is critical. This means all systems work in similar ways to each other, which means fewer people are needed cooperate them since there are fewer complexities to understand and deal with.”
Jones added that factors such as automation, standardization across products, and putting software and systems in close proximity to each other, are also important. “The combination of software engineering and systems engineering knowledge in SRE often leads to solutions that synthesize the best of both backgrounds, Jones said. He highlighted Google’s software network load balancer, Maglev, as an example of this approach at work.
The book’s 34 chapters (and five appendices) include topics such as “Embracing Risk” and “Eliminating Toil,” “Monitoring Distributed Systems”, the “Role of a Release Engineer,” and “Managing Critical States,” as well as a discussion on the “Production Environment at Google, from the Viewpoint of an SRE.” Site Reliability Engineering is accompanied by a website with greater detail about the book, and a number of members of Google technical teams spoke at SRE16 in Santa Clara, California this week, including Kripa Krishnan, Technical Program Director for Google, who provided a Friday morning keynote address, “Putting Together Great SRE Teams.”
Google Cloud Platform participated in our inaugural FinDEVr New York developer’s conference last month. Sales Engineer Rene-Paul LaFrie discussed “TensorFlow Machine Learning with Financial Data on Google Cloud Platform,” highlighting how machine learning techniques can support times series analysis.