When would 10 seconds become a big deal? When it comes to application performance.
For cloud-native microservices applications, 10 seconds is a really long time.
The things that can happen to your applications in 10 seconds are inexhaustible and most are not good.
But, before we dive into the details about what could happen to your applications, let’s look at some real-world events that show what can happen in 10 seconds:
We particularly like the Usain Bolt example because that amount of distance is a long way to run in less than 10 seconds.
For cloud-native application performance and availability, 10 seconds is an eternity. Transactions are zipping around across the internet, keeping the wheels of commerce well-lubricated.
What can happen in 10 seconds if something goes wrong? Well, thousands of transactions can experience delay or crash and not complete at all.
With this type of problem, revenue can drop due to lost sales. Customers will abandon shopping carts and your site and find another place to buy what they want. And your brand image can suffer.
Why, then, would it be acceptable for observability tools that capture metrics slowly or worse sample and aggregate metrics and traces? How can a platform like that be viewed as equivalent to an observability platform, such as the IBM® Instana® platform that gathers and contextualizes information at the speed of modern microservices. They allow the problems described above to linger for an extended period of time until the information you need to remediate the problem is available.
For PRISA Tecnologia, performance is key. When they encounter a performance problem, it has an immediate and detrimental impact on the business performance and the consumer’s perception of their brand.
“A one-second time difference in displaying content makes a huge difference to our audience’s experience.” – Jorge Tomé Hernando, Director of IT Architecture, Operations, Security and Workplace, PRISA Tecnologia
The major observability competitors of the Instana platform either sample metrics at 10 second intervals or aggregate metrics in one-minute intervals or more, compared to the Instana platform’s ultraprecise one-second metric interval. The Instana platform also delivers notification of an issue within three seconds. This response is illustrated in the observability detection gap diagram shown here.
Can you really afford to wait 10 seconds or up to a minute for you observability platform to tell you there’s an issue? With manual triage, maybe. But with automated or even semiautomated remediation you can’t.
For all applications, speed and reliability are the goals. To achieve better application performance and reliability, the go-to strategy that “a human always needs to fix a problem (MTTR),” has to change. Human intervention to fix remedial will overburden human resources and restrict the pace of change. It will also reduce service-level indicators (SLIs).
“With Instana, our day-to-day goal is to be able to guarantee a latency expectation. Our goal for service calls is to complete within less than 250 milliseconds. So, it’s not just for fire drills. In the day-to-day, we’re able to improve performance, and that drives us toward that 250 ms goal. Instana makes this possible.” – Bryce Hendrix, Lead Platform Architect, Dealerware
For improved performance with higher availability, automated AIOps is the way forward. Automated AIOps provides additional automation combined with AIOps, which is a path forward for achieving higher levels of performance plus availability. How? By letting automated AIOps resolve issues that the machine can flawlessly correct much faster than a human. There are many issues regarding infrastructure resource allocation and others that the machine can remediate and prevent before a human can even intervene.
Do these benefits mean that all application issues can be resolved with automated AIOps? Of course not.
There are many complex logic issues that only human triage can resolve, such as code issues and the like. But there are also many issues where automated AIOps is faster, more efficient and should be preferred for issue remediation.
In my previous post about mean time to prevention (MTTP), which is classified as the amount of time that observability plus AIOps takes to prevent an issue from negatively impacting hybrid cloud applications and infrastructure.
Automated AIOps adds a new option to the application issue remediation continuum. The previous diagram illustrates that continuum starts with fully automated issue remediation down to the human MTTR staple.
In the continuum, observability is the starting point for every type of remediation. The longer it takes for an issue to be detected by the observability platform, the longer it takes to begin the remediation process. This lapse in time means when automated AIOps is added, the difference between one-second detection and 10-second or more detection becomes huge. If your application can afford to wait more than 10 seconds for an issue to be detected, why use automated AIOps at all?
Automated AIOps remediation is the wave of the future. It’s the next logical step how to improve application performance and resiliency. Infrastructure performance issues often outweigh microservices code issues and will continue to do so into the future.
The new gold standard for application issue detection and remediation will become automated observability plus AIOps. They’ll be used in tandem to help ensure that issues don’t devolve into major problems.
If you want to achieve the full benefits of automated AIOps remediation, you need high-frequency, ultraprecise metrics and traces to feed the AIOps engine. And you can get them for a fraction of the cost of the slower observability technologies.
Indeed, a lot can happen in 10 seconds. With real-time metrics and automated AIOps, you can ensure that the bad issues don’t happen to your applications.
Get started with IBM Instana