The cloud may make it seem like, if you have enough budget, you have infinite computing power at your disposal. However, that computing power comes in slices. Hosts (physical and virtual alike), containers and functions—they all come with limitations on how much memory you can allocate.

On Linux, the out-of-memory (OOM) killer is a process in charge of preventing other processes from collectively exhausting the host’s memory. When a process tries to allocate more memory than available, the process that has the overall highest badness score—based, for example, on how much memory they allocate above what‘s allowed—will receive an OOM signal. This fundamentally means: “You are way out of line. Quit or get some of your child processes to quit, or it is lights out.”

Notice that the process that triggers the OOM may not be the process that receives the OOM signal. An application that hasn’t recently increased its memory usage may all of a sudden be issued an OOM signal because too many other applications have started on the same host.

The mechanics of an OOM signal sounds harsh, but it’s actually a very effective mechanism to prevent memory exhaustion on hosts, especially in case of applications not sized correctly or too many applications running in parallel (i.e., the hosts are not sized correctly to the workload).

For containerized platforms like Kubernetes, Cloud Foundry and Nomad, the use of memory—both in terms of sizing applications appropriately and how many applications to run at any one time on a host—is even more important. Generally, you don’t plan out in detail which applications are running on any one node. In many setups, containers will be allocated according to some logic by the orchestrator. Enforcing maximum memory consumption is critical for containers and control groups (cgroups), the foundation of virtually every container technology on Linux. These also use the OOM killer system to ensure that processes running in the same group (i.e., a container) don’t allocate more memory than they’re allowed to. When the processes in your containers try to allocate more memory than they’re allowed to, some will be terminated, often bringing their containers down with them.

At scale, everything is harder, including sizing. The more containers you run in your environments, the harder it is to understand when, how and why some of them go down. OOM killer can create unhealthy situations for your applications in which something is always crashing somewhere and then getting restarted, creating a continuous amount of errors for your end users that skew your service-level objectives (SLOs) and are really hard to troubleshoot.