API Connect OVA: Kubernetes Fails to Start Due to ResourceExhausted Error

Troubleshooting

Problem

In API Connect OVA deployments, Kubernetes may fail to start properly due to a ResourceExhausted error from the container runtime. This error occurs when the container runtime (containerd) attempts to process an excessively large number of containers, resulting in a gRPC message size exceeding the allowed limit.

The error message observed is:

rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (16835417 vs. 16777216)

This condition typically arises when thousands of paused containers accumulate on the portal VM, overwhelming the container runtime.

Diagnosing The Problem

Access the Portal VM
```
ssh <portal-node>
sudo -i
```

Runcrictl pods

crictl pods

Example:
root@subinvm1:~# crictl pods
E0908 17:58:44.529175  22655 remote_runtime.go:277] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (16835417 vs. 16777216)" filter="&PodSandboxFilter{Id:,State:nil,LabelSelector:map[string]string{},}"
FATA[0000] listing pod sandboxes: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (16835417 vs. 16777216)

If the command fails with a ResourceExhausted error, proceed to the next step.

Check kubelet logs

journalctl -u kubelet.service --since today

Example:
root@subinvm1:~# journalctl -u kubelet.service --since today
E0904 14:47:48.127461    3664 remote_runtime.go:277] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (16835417 vs. 16777216)" filter="nil"

List containers sorted by image

ctr -n k8s.io containers ls | awk 'NR==1{print; next} {print $0 | "sort -k2"}'

Example output :
CONTAINER                                                           IMAGE                          RUNTIME                  
0005065400712b61b6aba3f1827f535b643dc3ce85a540e98329a9c4acda7c1d    registry.k8s.io/pause:3.6      io.containerd.runc.v2

You may observe tens of thousands of containers using the imageregistry.k8s.io/pause:3.6.

Resolving The Problem

Please ensure you have proper backups. For more information about backups click here

Step 1: Clean Up Excess Paused Containers

Please ensure you have proper backups. For more information about backups click here

Create a cleanup script to remove the paused containers:

#!/bin/bash

# List all containers and filter those with image registry.k8s.io/pause:3.6
ctr -n k8s.io containers ls | grep "registry.k8s.io/pause:3.6" | awk '{print $1}' | while read -r CONTAINER_ID; do
  echo "
Processing container: $CONTAINER_ID" | tee -a deletion.log

  # Try to kill the task (if running)
  if ctr -n k8s.io tasks kill "$CONTAINER_ID" 2>>deletion.log; then
    echo "Task killed for container $CONTAINER_ID" | tee -a deletion.log
  else
    echo "No running task to kill for container $CONTAINER_ID" | tee -a deletion.log
  fi

  # Try to delete the task (if exists)
  if ctr -n k8s.io tasks delete "$CONTAINER_ID" 2>>deletion.log; then
    echo "Task deleted for container $CONTAINER_ID" | tee -a deletion.log
  else
    echo "No task to delete for container $CONTAINER_ID" | tee -a deletion.log
  fi

  # Delete the container
  if ctr -n k8s.io containers delete "$CONTAINER_ID" 2>>deletion.log; then
    echo "Container $CONTAINER_ID deleted successfully" | tee -a deletion.log
  else
    echo "Failed to delete container $CONTAINER_ID" | tee -a deletion.log
  fi

done

Step 2: Execute the Script

Copy the script to the portal VM.

Set executable permission:

chmod +x delete_pause_containers_with_logging.sh

Run the script:

./delete_pause_containers_with_logging.sh

Save the output by copying the terminal log into a text file for reference.

Step 3: Reboot the System

reboot

Step 4: Verify System Health

sudo -i
apic status
kubectl get pods -A

Confirm that Kubernetes components are running and pods are in a healthy state.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB77","label":"Automation Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSMNED","label":"IBM API Connect"},"ARM Category":[{"code":"a8mKe000000CaZXIA0","label":"API Connect-\u003EAPIC Platform - Other"}],"ARM Case Number":"TS020270900","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"10.0.5"}]

Tips