A site reliability engineer is a technical position that requires experience in both software development and IT operations. Understanding these positions enables SRE teams to fulfill their role in supporting the software development lifecycle. SRE is based on a strategy of resiliency through the consistent automation of processes.
Traditionally, site reliability engineering practices focused on performing IT operations and system administration tasks. These tasks include analyzing logs, performance tuning, applying patches, testing production environments, incident management and conducting postmortems. These tasks were initially done manually, which was time-consuming and prone to human error. The modernization of site reliability engineering involves the automation of these manual tasks.
Monitoring and logging play a key role in SRE. SRE teams use monitoring tools to track what is happening in software systems in real-time. Monitoring makes it possible to fix immediate technical issues and helps teams anticipate future problems and solve for them before they occur.
Logs serve as archives that can be analyzed to gain insights on how systems are functioning and improve system observability. Logging creates a roadmap that helps SRE teams understand the series of events that caused an unanticipated error. Engineers can automate the remediation of the error and prevent it from reoccurring. Both monitoring and logging help engineers identify points of failure and programmatically solve issues through automation so that they do not need to be fixed manually.
SRE teams also look for systems deficiencies through a process called chaos engineering. Chaos engineering is a strategy that site reliability engineers implement to intentionally cause failures in production and pre-production environments. The purpose of chaos engineering is to understand the impact production failures have on software systems and to develop stronger plans to mitigate failures in the future.
SRE also focuses on capacity planning, a process that determines the resources that are needed to run essential business functions, scale those business functions and develop new applications and features. In addition, SRE teams establish metrics that are used to evaluate the delivery of updates and the implementation of new features.