Solving the out-of-memory killer puzzle

We recently introduced the IBM Instana™ Crash Detector, which automatically detects and reports abnormal process terminations on all Linux® machines running Linux kernel 4.8 and above. The IBM Instana platform utilizes the Extended Berkeley Packet Filter (eBPF) functionalities of the Linux kernel to hook into the kernel itself and start listening for process terminations. Any abnormal terminations are signaled to the host agent, which screens them against the processes it monitors to avoid noise about processes that aren’t relevant and then sends the information upstream to the IBM Instana backend. This functionality has been shown to be a game changer for our clients as they work on troubleshooting incidents.

With Crash Detector, the IBM Instana software provides the critical pieces of data for many of the issues that are affecting the performance of our clients’ applications. We’re now enhancing this functionality by adding out-of-memory killer (OOM killer) events to Crash Detector, and it’s an incredibly valuable addition due to its relevance for containerized applications.

What is out-of-memory killer?

The cloud may make it seem like, if you have enough budget, you have infinite computing power at your disposal. However, that computing power comes in slices. Hosts (physical and virtual alike), containers and functions—they all come with limitations on how much memory you can allocate.

On Linux, the out-of-memory (OOM) killer is a process in charge of preventing other processes from collectively exhausting the host’s memory. When a process tries to allocate more memory than available, the process that has the overall highest badness score—based, for example, on how much memory they allocate above what‘s allowed—will receive an OOM signal. This fundamentally means: “You are way out of line. Quit or get some of your child processes to quit, or it is lights out.”

Notice that the process that triggers the OOM may not be the process that receives the OOM signal. An application that hasn’t recently increased its memory usage may all of a sudden be issued an OOM signal because too many other applications have started on the same host.

The mechanics of an OOM signal sounds harsh, but it’s actually a very effective mechanism to prevent memory exhaustion on hosts, especially in case of applications not sized correctly or too many applications running in parallel (i.e., the hosts are not sized correctly to the workload).

For containerized platforms like Kubernetes, Cloud Foundry and Nomad, the use of memory—both in terms of sizing applications appropriately and how many applications to run at any one time on a host—is even more important. Generally, you don’t plan out in detail which applications are running on any one node. In many setups, containers will be allocated according to some logic by the orchestrator. Enforcing maximum memory consumption is critical for containers and control groups (cgroups), the foundation of virtually every container technology on Linux. These also use the OOM killer system to ensure that processes running in the same group (i.e., a container) don’t allocate more memory than they’re allowed to. When the processes in your containers try to allocate more memory than they’re allowed to, some will be terminated, often bringing their containers down with them.

At scale, everything is harder, including sizing. The more containers you run in your environments, the harder it is to understand when, how and why some of them go down. OOM killer can create unhealthy situations for your applications in which something is always crashing somewhere and then getting restarted, creating a continuous amount of errors for your end users that skew your service-level objectives (SLOs) and are really hard to troubleshoot.

Where monitoring has let OOM killer slip through the cracks

Finding out why any single process has been disposed of by OOM killer depends a lot on the technology you use. Some software packages will log it in their own logs. Or you may end up running some command like the following on your hosts—on each of them:

     #CentOS
     grep -i "out of memory" /var/log/messages
     #Debian / Ubuntu
     grep -i "out of memory" /var/log/kern.log

Looks tame, but it’s definitely not the kind of task you want to run across your production fleet to try to understand why MySQL has kicked the bucket again at 3 AM. Especially when it’s on a hunch, since nothing else seems to explain why the database process is no longer there.

In other words, OOM killer is a system of undeniable importance and efficacy for reliability that fails to provide sufficient observability. But the IBM Instana platform is here to fix it for you.

How IBM Instana software detects the OOM killer process with eBPF

Further building upon the eBPF foundation that brought you Crash Detector, IBM Instana software now comes with an out-of-the-box OOM killer detector. When your process is monitored by IBM Instana software, it receives an OOM signal in real-time. Not only that it happened, but also how the situation was resolved (i.e., which process got killed).

Similar to most IBM Instana features, all you need to do is install the IBM Instana host agent and watch OOM killer go about its grim business. It also shows you how much memory the killed process allocated at the time of the event, so you can understand why it was marked by OOM killer as “bad.”

Problems you can solve with OOM killer

Determining how and why a process was terminated or why a process was killed with an OOM killer can take hours—if not days—to uncover without the proper tools. With the IBM Instana Crash Detector, users now immediately have the root cause for every abnormal process termination and every OOM killer successful process.

Need to understand why a container died? No problem. With IBM Instana Crash Detector OOM killer, you’ll know that perhaps your Java Virtual Machine (JVM), running a very important batch job, allocated more resources than it was allowed. Or maybe you need to determine the cause of why you’re having so many Hypertext Preprocessor (PHP) request failures or why your database disappeared. Again, with IBM Instana Crash Detector OOM killer you’ll have immediate access to the root cause of these issues.

Save time on troubleshooting application performance issues with OOM killer

To get started saving yourself and your DevOps teams time troubleshooting OOM killer events, simply install the IBM Instana agent on your Linux OS today. If you don’t already have an IBM Instana instance, you can see how the IBM Instana Crash Detector with OOM killer detection works with a free trial.

Author

IBM Instana Team

IBM Instana