October 24, 2023 By Sunil Joshi 5 min read

As enterprises invest their time and money into digitally transforming their business operations, and move more of their workloads to cloud platforms, their overall systems organically become largely hybrid by design. A hybrid cloud architecture also means too many moving parts and multiple service providers, therefore posing a much bigger challenge when it comes to maintaining highly resilient hybrid cloud systems.

The business impact of system outages

Let’s look at some data points regarding system resiliency over the last few years. Several studies and client conversations reveal that major system outages over the last 4-5 years have either remained flat or have increased slightly, year over year. Over the same timeframe, the revenue impact of the same outages has gone up significantly.

There are several factors contributing to this increase in business impact from outages.

Increased rate of change

One of the very reasons to invest in digital transformation is to have the ability to make frequent changes to the system to meet business demand. It is also to be noted that 60-80% of all outages are usually attributed to a system change, be it functional, configuration or both. While accelerated changes are a must-have for business agility, this has also caused outages to be a lot more impactful to revenue.

New ways of working

The human element is mostly under rated when to comes to digital transformation. The skills needed with Site Reliability Engineering (SRE) and hybrid cloud management are quite different from a traditional system administration. Most enterprises have invested heavily in technology transformation but not so much on talent transformation. Therefore, there is a glaring lack of skills needed to keep systems highly resilient in a hybrid cloud ecosystem.

Over-loaded network and other infrastructure components

With highly distributed architecture comes the challenges of capacity management, especially network. A large portion of hybrid cloud architecture usually includes multiple public cloud providers, which means payloads traversing from on-premises to public cloud and back and forth. This can add disproportionate burden on network capacity, especially if not properly designed leading to either a complete breakdown or unhealthy responses for transactions. The impact of unreliable systems can be felt at all levels. For end users, downtime could mean slight irritation to significant inconvenience (for banking, medical services etc.). For IT Operations team, downtime is a nightmare when it comes to annual metrics (SLA/SLO/MTTR/RPO/RTO, etc.). Poor Key Performance Indicators (KPIs) for IT operations mean lower morale and higher degrees of stress, which can lead to human errors with resolutions. Recent studies have described the average cost of IT outages to be in the range of $6000 to $15,000 per minute. Cost of outages is usually proportionate to the number of people depending on the IT systems, meaning large organization will have a much higher cost per outage impact as compared to medium or small businesses.

AI solutions for hybrid cloud system resiliency

Now let’s look at some potential mitigating solutions for outages in hybrid cloud systems. Generative AI, when combined with traditional AI and other automation techniques can be very effective in not only containing some of the outages, but also mitigating the overall impact of outages when they do occur.

Release management

As stated earlier, rapid releases are a must-have these days. One of the challenges with rapid releases is tracking the specific changes, who did them, and what impact they have on other sub-systems. Especially in large teams of 25+ developers, getting a good handle of changes through change logs is a herculean task, mostly manual and prone to error. Generative AI can help here by looking at bulk change logs and summarizing specifically what changed and who made the change, as well as connecting them to specific work items or user stories associated with the change. This capability is even more relevant when there is a need to rollback a subset of changes because of something being negatively impacted due to the release.

Toil elimination

In many enterprises, the process to take workloads from lower environments to production is very cumbersome, and usually has several manual interventions. During outages, while there are “emergency” protocols and process for rapid deployment of fixes, there are still several hoops to go through. Generative AI, along with other automation, can help greatly speed up phase gate decision-making (e.g., reviews, approvals, deployment artifacts, etc.), so deployments can go through faster, while still maintaining the quality and integrity of the deployment process.

Virtual agent assist

IT Operations personnel, SREs and other roles can greatly benefit by engaging with virtual agent assist, usually powered by generative AI, to get answers for commonly occurring incidents, historical issue resolution and summarization of knowledge management systems. This often means issues can be resolved faster. Empirical evidence suggests a 30-40% productivity gain by using generative AI powered virtual agent assist for operations related tasks.


As an extension to the virtual agent assist concept, generative AI infused AIOps can help with better MTTRs by creating executable runbooks for faster issue resolution. By leveraging historical incidents and resolutions and looking at current health of infrastructure and applications (apps), generative AI can also help prescriptively inform SREs of any potential issues that may be brewing. In essence, generative AI can take operations from being reactive to predictive and get ahead of incidents.

Challenges with generative AI implementation

While there are strong use cases for implementing generative AI to improve IT Operations, it would be remiss if some of the challenges weren’t discussed. It is not always easy to figure out what Large Language Model (LLM) would be the most appropriate for the specific use case being solved. This area is still evolving rapidly, with newer LLMs becoming available almost daily.

Data lineage is another issue with LLMs. There needs to be total transparency on how models were trained so there can be enough trust in the decisions the model will recommend.

Finally, there are additional skill requirements for using generative AI for operations. SREs and other automation engineering will need to be trained on prompt engineering, parameter tuning and other generative AI concepts for them to be successful.

Next steps for generative AI and hybrid cloud systems

In conclusion, generative AI can bring in significant productivity gains when augmented with traditional AI and automation for many of the IT Operations tasks. This will help hybrid cloud systems to be more resilient and, in due course, help mitigate outages that are impacting business operations.

Discover more about the impact of generative AI on business Learn more about site reliability engineering

More from Business transformation

How to build a successful risk mitigation strategy

4 min read - As Benjamin Franklin once said, “If you fail to plan, you are planning to fail.” This same sentiment can be true when it comes to a successful risk mitigation plan. The only way for effective risk reduction is for an organization to use a step-by-step risk mitigation strategy to sort and manage risk, ensuring the organization has a business continuity plan in place for unexpected events. Building a strong risk mitigation strategy can set up an organization to have a…

Introducing multi-volume snapshots for IBM Cloud Block Storage for VPC

5 min read - As businesses embrace the cloud, data continuity and data protection stand as a cornerstone for resilient business operations. IBM Cloud® offers Block Storage for VPC as a foundational building block, providing not only high-performance block storage but also a robust backup and recovery mechanism for safeguarding your data. IBM Cloud Block Storage Snapshots for VPC is a powerful tool for creating point-in-time backups of individual volumes. Leveraging a differential approach, Block Storage Snapshots for VPC capture only the changes made…

Six ways AI can influence the future of customer service

4 min read - Organizations have always used some degree of technology to provide an excellent customer experience, but the future of customer service will demand even more advancements to meet customers’ growing expectations. There is no question that customer service is about to take a massive leap forward, thanks to emerging trends like artificial intelligence (AI). In fact, nearly 50% of CEOs feel increased customer expectations that organizations will accelerate the use of new technologies like generative AI, according to an IBV CEO…

IBM is announcing Red Hat Enterprise Linux 7 is going End of Support on 30 June 2024

3 min read - Overview: Enterprises are under attack from hackers, and administrators need to deploy operating systems in configurations that minimize attack vectors and apply security patches to maintain the latest code. It is a common best practice to take inventory of operating systems to see the status of support from vendors. Software is not supported forever, and it is prudent to migrate off a Red Hat Enterprise Linux (RHEL) Server version well before it goes End of Life/Support. Enterprises should minimize disruptions,…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters