IBM Cloud Private node problem detector and Draino

Issues can arise on nodes that affect pods that are running on them. When problems are detected, IBM Cloud Private uses the node problem detector and Draino to identify problem nodes and then unschedule and drain them so that the issues can be resolved and the pods rescheduled.

Important: This content is a technical preview, and should not be relied on in a production environment.

The node problem detector and Draino collect node problems from various daemons and make the issues visible to the upstream layers in the cluster management stack. If issues occur, IBM Cloud Private unschedules (cordons) the problematic nodes immediately and drains them after a configurable amount of time. The default is 10 minutes. You can reschedule (uncordon) them after the problems are solved.

To learn more, see the Draino project Opens in a new tab and the Kubernetes node-problem-detector project to try it out.

Prerequisites

Ensure that each node in the IBM Cloud Private cluster has the directory /var/log/journal. If the directory does not exist, create it.

Installation

You can enable the node problem detector and Draino either during installation through the config.yaml file or after installation in the management console by using a Helm chart.

Enabling the node-problem-detector-draino parameter during cluster installation

Following the installation procedure, during Step 3 customize your cluster, open the /<installation_directory>/cluster/config.yaml file.

In the list of management services, set node-problem-detector-draino to enabled. For example:

management_services:
 istio: disabled
 vulnerability-advisor: disabled
 storage-glusterfs: disabled
 storage-minio: disabled
 key-management-hsm: disabled
 platform-security-netpols: disabled
 node-problem-detector-draino: enabled

Save and exit the file.

The node problem detector and Draino are installed by the IBM Cloud Private installer during cluster installation.

Installing the node-problem-detector-draino chart for an existing cluster

Required user type or access level: Cluster administrator, team administrator, or operator

Log in to the IBM Cloud Private management console.
Click Catalog.
Find the node-problem-detector-draino chart by using the search bar.
Select the node-problem-detector-draino chart. A readme file displays information about installing, uninstalling, configuring, and other chart details for node-problem-detector-draino.
To configure the chart, click Configure.
Name the Helm release and select the kube-system namespace from the menu. The name must consist of lowercase alphanumeric characters or dash characters (-), and must start and end with an alphanumeric character.
Ensure that you read and agree to the license agreement.
Optional: Customize the All parameters fields to your preference.
To deploy the node-problem-detector-draino chart and create an node-problem-detector-draino release, click Install.

Verifying the installation

After installation completes, verify the node-problem-detector-draino that you have enabled is created and running:

Ensure the corresponding Kubernetes pods are deployed and all containers are up. Run the following command:

kubectl -n kube-system get pods | grep -E "node-problem-detector|draino"

The output might resemble the following content:
npm-draino-57df88dc45-cls7r          1/1     Running     0       2h      10.1.96.125       <none>
npm-node-problem-detector-68x5s      1/1     Running     0       2h      10.1.16.147       <none>
npm-node-problem-detector-8rzkq      1/1     Running     0       2h      10.1.62.146       <none>
npm-node-problem-detector-b2vzb      1/1     Running     0       2h      10.1.75.82        <none>
npm-node-problem-detector-bgbs4      1/1     Running     0       2h      10.1.249.116      <none>
npm-node-problem-detector-ltvjn      1/1     Running     0       2h      10.1.96.126       <none>
npm-node-problem-detector-r2drx      1/1     Running     0       2h      10.1.93.218       <none>
npm-node-problem-detector-t99f2      1/1     Running     0       2h      10.1.32.179       <none>

You are now ready to monitor your cluster nodes.