Site Reliability Engineer (SRE) is a title that’s becoming more and more mainstream these days. When doing a search on SRE job postings, you will see companies like Google, Facebook, Twitter, LinkedIn, Airbnb, Adobe, Stripe, Yelp, Dropbox, and GoDaddy all trying to hire people for this role.
From the SRE title itself, we can conclude at a high level that their job is about making systems (think software, data, servers or other computing/networking resources) available and reliable to their users. Just like its close cousin, DevOps Engineer, SRE can be a challenging role to recruit for. Let’s dive in deeper to look beneath the surface.
The Origin of SRE
Google coined the term “Site Reliability Engineering” in the early 2000s. The key figure in Site Reliability Engineering is Ben Treynor, who joined Google in 2003 as Site Reliability Tsar and the founder of Google’s Site Reliability team. He has grown the team from 7 people to over 1,500, responsible for Google’s worldwide internal and external network, data centers, hardware operations, as well as Google Cloud Platform.
Traditionally, there had been constant tension and struggles between development teams and operations team. The software development team focuses on creating new features and fixes. They want to deploy their updates in the production systems and make them available to users as soon as possible. The operations team, on the other hand, is a fierce proctor of the availability and stability of the production system. When the system breaks, the operations team is the one who gets calls at night.
In other words, the development team makes frequent changes, wants these changes in the production system, and, inevitably, introduces instability and unreliability. The operations team, prioritizing stability, would push back when the development team attempts frequent and fast changes to something that has been working.
To close the gap between the development team and the operations team so they can better collaborate, Ben Treynor came up with a new approach with his SRE team, to reliably delivery Google’s large-scale applications and massively distributed systems to their hundreds of millions of users.
What is SRE?
Google describes their SREs this way: “We build our own creative engineering solutions to operations problems. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation.”
During an interview, Ben Treynor defined SRE as “Fundamentally, it’s what happens when you ask a software engineer to design an operations function.” By applying software engineering mindset, instead of the traditional systems administration mindset, to operations function, it better bridges the operations and development efforts.
Ben Treynor also defines the day-to-day responsibilities of SREs as being “responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.” SRE engineers are responsible for the stability of the production environment, but, at the same time, are committed to new software features and operational improvement.
According to Google, “SRE is also a mindset and a set of engineering approaches to running better production systems.”For software engineers to be directly exposed to operations challenges, they become more mindful of the importance of code quality and automation testing to achieve a reliable production system and operations.
The People and Skills of an SRE Team
Typically, SREs spend half of their time on operations related tasks and the other half on development related tasks. This is because the software systems they are responsible for are often distributed, large-scale, and mission-critical, and need to be highly available, elastic (scaling fast and smoothly both up and down), cost-efficient, and fault tolerant. To design and manage systems like these, SREs rely heavily on software engineering, automation and self-healing mechanism to accomplish their goals.
Given this, what type of people with what type of skills are companies looking for when building their SRE teams?
Ben Treynor said in his interview that, on his SRE team, “Typically we hire about a 50-50 mix of people who have more of a software background and people who have more of a systems background. It seems to be a really good mix.”
This means either those who are software engineers with networking and systems (such as Unix/Linux) knowledge, or people coming from systems administration or network engineering background with the ability to code as well as software engineers. The dual requirements of programming skills and network/systems skills make recruiting and hiring SREs difficult, even for companies like Google.
Here are some qualifications we would see in typical SRE job descriptions (from different companies):
Systems and networking related skills
• Experience with web server configuration, monitoring, trending, network design, high availability (Yelp)
• Demonstrable knowledge of TCP/IP, Linux operating system internals, filesystems, disk/storage technologies and storage protocols(Twitter)
• Experience working with fault-tolerant, highly available, high traffic/throughput, distributed, scalable systems (Oracle)
• You think at scale… big scale… with a focus on ensuring stability and maximizing the performance of services you own (Box)
• Knowledge of AWS and container orchestrators such as Kubernetes, Mesos, or Docker Swarm etc. (ExtraHopNetworks)
• Experience with metrics, monitoring and alerting tools such as Splunk, wavefront, AppDynamics (Intuit)
• You’ve attained a deep understanding of networking protocols (e.g. TCP/IP, HTTP, DNS, etc) (Evernote)
Software development related skills
• Practical knowledge of shell scripting and at least one scripting language (Python, Ruby, Perl) (Twitter)
• Proficient in coding complex systems using Python, Ruby, Java, C/C (Oracle)
• Command of your favorite modern programming language: Python, Ruby, Java, C++, etc (Yelp)
• Software development experience with Java, Python or Golang (Atlassian)
• Experience in data structures, algorithms, and software design (Indeed)
• You’ve built extensible and maintainable automation (Shell, Python, or Go preferred) (Evernote)
• Proclivity towards efficient programming emphasizing improvement via complexity analysis (Apple)
• Experience in defensive coding practices and patterns for high-availability (Home Depot)
SRE vs. DevOps
At this point, you might be thinking, “All the SRE stuff looks like DevOps. How are SRE and DevOps related? Are they the same or different?”
DevOps emerged a little later than SRE independently outside of Google, also aiming to solve the same problem – unifying software development and operations – with the Agile movement being a main contributing factor. The skills and tools used by SRE and DevOps are the same for the most part.
Some people view them as the same. Some differentiate them. In that case, the differences are likely to be the mindset, emphasis, and priorities between SRE and DevOps. These may seem subtle and nuanced. To get into the details, it would deserve a whole separate article.