A global software SRE organization becomes more resilient with automation
The IBM software site reliability engineering (SRE) team plays a crucial role in maintaining the dependability and security of IBM’s SaaS offerings and managed services infrastructure. Operating across IBM Cloud®, AWS, Microsoft Azure and Google Cloud Platform, the SRE team delivers nearly 70 SaaS solutions globally, collecting vast amounts of data down to the microservices level.
Creating a comprehensive resilience evaluation was a significant challenge for this team. Kevin Yu, Principal Site Reliability Engineer, explains, “Our previous methods involved workshops and extensive use of spreadsheets for assessment against our playbook which could literally take months to complete and were also a challenge to update. These methods lacked the ability to provide a holistic view of our system’s resilience posture.”
The SRE team also needed a solution to accurately measure and track key resilience metrics, such as availability, recoverability and observability, over time to identify vulnerabilities and implement improvements effectively.
Enhancing monthly operational reviews (MORs) was another key challenge. The inefficiencies of the SRE team’s existing MOR process hindered their ability to swiftly identify and resolve issues. Organizational silos further complicated the process, making it difficult to align different teams to a common resilience strategy.
The SRE team deployed the IBM Concert® platform to help them reduce silos, drive continuous improvement and unlock a repeatable approach to resilience.
The solution combines automation and AI-powered insights into a standardized, scalable framework to assess, enhance and sustain resilience.
Before implementing IBM Concert, resilience evaluations were a manual, labor-intensive task that could take months to complete. The solution’s resilience framework has automated this process, providing a comprehensive view of key resilience metrics such as availability, recoverability and observability. Automation has significantly reduced the time and effort required, enabling the SRE team to focus on enhancing application robustness and reliability.
The previous MOR process was inefficient and time-consuming, often requiring hundreds of hours to extract and collate data. With IBM Concert, the SRE team now summarizes and reports data more efficiently, then provides their most accurate information to their stakeholders for compliance assessment and strategic planning. The solution also consolidates data from various sources to create a unified view that enhances the SRE team’s ability to solve problems.
As Yu explains, “Concert helped us break down silos and be more productive. We now have a scalable framework to measure, improve and sustain application resilience across IBM.”
Deploying the resilience framework in IBM Concert brought transformative results to IBM’s SRE team.
“The solution has transformed our approach to application resilience,” says Yu. “By automating key resilience data collection, we addressed silos and operationalized resilience. As a result, IBM Concert resilience posture reduced person-days in an IBM enterprise-wide resilience posture evaluation per application by 62%, compared to manual evaluation.1
The SRE team says the transformation has also improved their productivity and fostered better collaboration with other teams. Using the solution’s standardized framework, the SRE team can align different parts of the organization to a common resilience strategy where they see an improvement in overall coordination and communication. Additionally, comprehensive and consistent reporting capabilities have enhanced transparency and accountability within IBM. Internal stakeholders have indicated they now have a better understanding of resilience metrics and issue management, leading to more informed decision-making.
By leveraging the resilience posture of IBM Concert, the SRE team has achieved a more streamlined and effective approach to resilience evaluation and MORs, helping ensure IBM’s SaaS and managed services infrastructure remains dependable and secured. “IBM Concert resilience posture reduced the IBM SRE team’s person-hours spent in MOR by 72% compared to manually producing the report,” says Yu.1
The IBM Software SRE organization is a global team focused on delivering highly available and scalable production SaaS for IBM software products. The Software SRE team provisions, deploys, monitors, maintains and manages incidents by standardizing tooling, processes, automation, runbooks and practices. They work closely with IBM software development teams to design and implement changes, providing a highly resilient service throughout the software lifecycle.
1: Based on results from an internal test. Individual results may vary.
© Copyright IBM Corporation 2025. IBM, the IBM logo, Concert, IBM Cloud, and IBM Concert are trademarks or registered trademarks of IBM Corp., in the U.S. and/or other countries.
Microsoft is a trademark of Microsoft Corporation in the United States, other countries, or both.
Examples presented as illustrative only. Actual results will vary based on client configurations and conditions and, therefore, generally expected results cannot be provided.