We recently introduced the IBM Instana™ Crash Detector, which automatically detects and reports abnormal process terminations on all Linux® machines running Linux kernel 4.8 and above. The IBM Instana platform utilizes the Extended Berkeley Packet Filter (eBPF) functionalities of the Linux kernel to hook into the kernel itself and start listening for process terminations. Any abnormal terminations are signaled to the host agent, which screens them against the processes it monitors to avoid noise about processes that aren’t relevant and then sends the information upstream to the IBM Instana backend. This functionality has been shown to be a game changer for our clients as they work on troubleshooting incidents.

With Crash Detector, the IBM Instana software provides the critical pieces of data for many of the issues that are affecting the performance of our clients’ applications. We’re now enhancing this functionality by adding out-of-memory killer (OOM killer) events to Crash Detector, and it’s an incredibly valuable addition due to its relevance for containerized applications.

What is out-of-memory killer?

The cloud may make it seem like, if you have enough budget, you have infinite computing power at your disposal. However, that computing power comes in slices. Hosts (physical and virtual alike), containers and functions—they all come with limitations on how much memory you can allocate.

On Linux, the out-of-memory (OOM) killer is a process in charge of preventing other processes from collectively exhausting the host’s memory. When a process tries to allocate more memory than available, the process that has the overall highest badness score—based, for example, on how much memory they allocate above what‘s allowed—will receive an OOM signal. This fundamentally means: “You are way out of line. Quit or get some of your child processes to quit, or it is lights out.”

Notice that the process that triggers the OOM may not be the process that receives the OOM signal. An application that hasn’t recently increased its memory usage may all of a sudden be issued an OOM signal because too many other applications have started on the same host.

The mechanics of an OOM signal sounds harsh, but it’s actually a very effective mechanism to prevent memory exhaustion on hosts, especially in case of applications not sized correctly or too many applications running in parallel (i.e., the hosts are not sized correctly to the workload).

For containerized platforms like Kubernetes, Cloud Foundry and Nomad, the use of memory—both in terms of sizing applications appropriately and how many applications to run at any one time on a host—is even more important. Generally, you don’t plan out in detail which applications are running on any one node. In many setups, containers will be allocated according to some logic by the orchestrator. Enforcing maximum memory consumption is critical for containers and control groups (cgroups), the foundation of virtually every container technology on Linux. These also use the OOM killer system to ensure that processes running in the same group (i.e., a container) don’t allocate more memory than they’re allowed to. When the processes in your containers try to allocate more memory than they’re allowed to, some will be terminated, often bringing their containers down with them.

At scale, everything is harder, including sizing. The more containers you run in your environments, the harder it is to understand when, how and why some of them go down. OOM killer can create unhealthy situations for your applications in which something is always crashing somewhere and then getting restarted, creating a continuous amount of errors for your end users that skew your service-level objectives (SLOs) and are really hard to troubleshoot.

Where monitoring has let OOM killer slip through the cracks

Finding out why any single process has been disposed of by OOM killer depends a lot on the technology you use. Some software packages will log it in their own logs. Or you may end up running some command like the following on your hosts—on each of them:

     grep -i "out of memory" /var/log/messages
     #Debian / Ubuntu
     grep -i "out of memory" /var/log/kern.log

Looks tame, but it’s definitely not the kind of task you want to run across your production fleet to try to understand why MySQL has kicked the bucket again at 3 AM. Especially when it’s on a hunch, since nothing else seems to explain why the database process is no longer there.

In other words, OOM killer is a system of undeniable importance and efficacy for reliability that fails to provide sufficient observability. But the IBM Instana platform is here to fix it for you.

How IBM Instana software detects the OOM killer process with eBPF

Further building upon the eBPF foundation that brought you Crash Detector, IBM Instana software now comes with an out-of-the-box OOM killer detector. When your process is monitored by IBM Instana software, it receives an OOM signal in real-time. Not only that it happened, but also how the situation was resolved (i.e., which process got killed).

This process decided to fall on its sword, which was very honorable.

Similar to most IBM Instana features, all you need to do is install the IBM Instana host agent and watch OOM killer go about its grim business. It also shows you how much memory the killed process allocated at the time of the event, so you can understand why it was marked by OOM killer as “bad.”

Problems you can solve with OOM killer

Determining how and why a process was terminated or why a process was killed with an OOM killer can take hours—if not days—to uncover without the proper tools. With the IBM Instana Crash Detector, users now immediately have the root cause for every abnormal process termination and every OOM killer successful process.

Need to understand why a container died? No problem. With IBM Instana Crash Detector OOM killer, you’ll know that perhaps your Java Virtual Machine (JVM), running a very important batch job, allocated more resources than it was allowed. Or maybe you need to determine the cause of why you’re having so many Hypertext Preprocessor (PHP) request failures or why your database disappeared. Again, with IBM Instana Crash Detector OOM killer you’ll have immediate access to the root cause of these issues.

Save time on troubleshooting application performance issues with OOM killer

To get started saving yourself and your DevOps teams time troubleshooting OOM killer events, simply install the IBM Instana agent on your Linux OS today. If you don’t already have an IBM Instana instance, you can see how the IBM Instana Crash Detector with OOM killer detection works with a free trial.

Sign up for your free two-week trial


More from IBM Instana

In observability, “automation” is spelled I-N-S-T-A-N-A

3 min read - Modern application environments need real-time automated observability to have visibility and insights into what is going on. Because of the highly dynamic nature of microservices and the numerous interdependencies among application components, having an automated approach to observability is essential. That’s why traditional solutions like New Relic struggle to keep up with monitoring in cloud-native environments.  Automation in observability is a requirement When an application is not performing properly, customers are unhappy and your business can suffer. If your observability…

Debunking observability myths – Part 6: Observability is about one part of your stack

3 min read - In our blog series, we’ve debunked the following observability myths so far: Part 1: You can skip monitoring and rely solely on logs Part 2: Observability is built exclusively for SREs Part 3: Observability is only relevant and beneficial for large-scale systems or complex architectures Part 4: Observability is always expensive Part 5: You can create an observable system without observability-driven automation Today, we're delving into another misconception about observability—the belief that it's solely applicable to a specific part of your stack or…

Observing Camunda environments with IBM Instana Business Monitoring

3 min read - Organizations today struggle to detect, identify and act on business operations incidents. The gap between business and IT continues to grow, leaving orgs unable to link IT outages to business impact.  Site reliability engineers (SREs) want to understand business impact to better prioritize their work but don’t have a way of monitoring business KPIs. They struggle to link IT outages to business impacts because data is often siloed and knowledge is tribal. It forces teams into a highly reactive mode…

Buying APM was a good decision (so is getting rid of it)

4 min read - For a long time, there wasn’t a good standard definition of observability that encompassed organizational needs while keeping the spirit of IT monitoring intact. Eventually, the concept of “Observability = Metrics + Traces + Logs” became the de facto definition. That’s nice, but to understand what observability should be, you must consider the characteristics of modern applications: Changes in how they’re developed, deployed and operated The blurring of lines between application code and infrastructure New architectures and technologies like Docker,…