IBM Support

Cloud Pak for Security: Pods reporting OOMKilled status

Troubleshooting


Problem

A pod is reporting OOMKilled status, and restarting until marked as CrashLoopBackOff status, in Cloud Pak for Security (CP4S).

Symptom

Node keeps cycling between a "Running" and "CrashLoopBackoff" state. The pod enters an OOMKill (out-of-memory kill) state before the pod enters a CrashLoopBackoff state.

A review of a failed pod details and logs might show OOMKilled in one of the following ways:

oc get pods -w | grep -i OOMKilled

storage-ingestion-pipeline-AA-BBBBBBBB-CCCCC                      0/1     OOMKilled     2 (45s ago)     2m23s
oc logs -f storage-ingestion-pipeline-AA-BBBBBBBB-CCCCC

{"label":"c.i.s.q.c.s.r.InMemoryCache","level":"error","ibm_datetime":"<DATE>","thread_name":"vert.x-eventloop-thread-0","message":"Schema id not found for schema","error":{"stack":""}}
{"label":"o.a.k.c.c.ConsumerConfig","level":"warn","ibm_datetime":"<DATE>","thread_name":"vert.x-eventloop-thread-5","message":"The configuration 'apicurio.registry.as-confluent' was supplied but isn't a known config."}
{"label":"o.a.k.c.u.AppInfoParser","level":"warn","ibm_datetime":"<DATE>","thread_name":"vert.x-eventloop-thread-5","message":"Error registering AppInfo mbean"}
oc describe pod tis-AAA-BBBBBBBBBB-CCCCC

    State:          Running
      Started:      <DATE>
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      <DATE>
      Finished:     <DATE>
cp4s-01                                            cp4s-telemetry-operator-AAAAAAAAAA-BBBBB                             0/1     CreateContainerError              621        4d20h 

    State:          Waiting
      Reason:       CreateContainerError
    Last State:     Terminated
      Reason:       OOMKilled 

Warning  FailedCreatePodSandBox  5m47s (x2018 over 5h19m)  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = signal: killed 

Cause

This OOMKilled error can be caused by a number of issues, such as:
  • Pod configuration is wrong
  • Pod is missing dependents, such as secrets
  • Pod is consuming more memory than its resource limits allow

Resolving The Problem

Note: The following instructions are for Cloud Pak for Security (CP4S) apps and not SOAR Cases, core services, third-party apps, or custom apps.
  1. Verify the pod is scheduled to have enough resources.
    1. Open Red Hat OpenShift web user-interface (UI).
    2. From Developer view, select Project.
    3. Under Inventory, select Pods
    4. Select the pod receiving OOMKill status.
    5. Select Metrics tab.
    6. Verify, from the graphs, the pod is using too many resources.
  2. If over committed, increase pod memory setting by selecting Details tab.
    1. Under Owner section, select the replica set.
    2. Under Owner section, select the deployment config.
    3. Select YAML tab.
    4. Search for the resource you need to increase.
      For example, increase request to 1 gigabyte and increase limit to 2 gigabytes:
      containers:
              - resources:
                  limits:
                    cpu: '10'
                    memory: 2000Mi
                  requests:
                    cpu: 250m
                    memory: 1000Mi
    5. Select Save.
  3. If the pod is not over committed on configured resources, check that the node is not over committed.
    1. Select Details tab.
    2. Under Node section, select the node.
    3. Verify no resource error messages are displayed:
      This node’s CPU resources are overcommitted. The total CPU resource limit of all pods exceeds the node’s total capacity. Pod performance will be throttled under high load.
      This node’s memory resources are overcommitted. The total memory resource limit of all pods exceeds the node’s total capacity. Pods will be terminated under high load.
    4. If node is over committed, reschedule the pod to a more resource healthy node. If assistance is needed with scheduling, work with your Red Hat OpenShift Administrator.
If you continue to exhibit OOMKill status for supported out of the box functionality, or supported apps, after contact support.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB24","label":"Security Software"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSTDPP","label":"IBM Cloud Pak for Security"},"ARM Category":[{"code":"a8m0z0000001h8uAAA","label":"Install or Upgrade"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
11 January 2023

UID

ibm16854285