The IBM software site reliability engineering (SRE) team plays a crucial role in maintaining the dependability and security of IBM’s SaaS offerings and managed services infrastructure. Operating across IBM Cloud®, AWS, Microsoft Azure and Google Cloud Platform, the SRE team delivers nearly 70 SaaS solutions globally, collecting vast amounts of data down to the microservices level.

Creating a comprehensive resilience evaluation was a significant challenge for this team. Kevin Yu, Principal Site Reliability Engineer, explains, “Our previous methods involved workshops and extensive use of spreadsheets for assessment against our playbook which could literally take months to complete and were also a challenge to update. These methods lacked the ability to provide a holistic view of our system’s resilience posture.”

The SRE team also needed a solution to accurately measure and track key resilience metrics, such as availability, recoverability and observability, over time to identify vulnerabilities and implement improvements effectively.

Enhancing monthly operational reviews (MORs) was another key challenge. The inefficiencies of the SRE team’s existing MOR process hindered their ability to swiftly identify and resolve issues. Organizational silos further complicated the process, making it difficult to align different teams to a common resilience strategy.