IBM Cloud Private node problem detector and Draino
Issues can arise on nodes that affect pods that are running on them. When problems are detected, IBM Cloud Private uses the node problem detector and Draino to identify problem nodes and then unschedule and drain them so that the issues can be resolved and the pods rescheduled.
Important: This content is a technical preview, and should not be relied on in a production environment.
The node problem detector and Draino collect node problems from various daemons and make the issues visible to the upstream layers in the cluster management stack. If issues occur, IBM Cloud Private unschedules (cordons) the problematic nodes immediately and drains them after a configurable amount of time. The default is 10 minutes. You can reschedule (uncordon) them after the problems are solved.
To learn more, see the Draino project and the Kubernetes node-problem-detector project
to try it out.
Prerequisites
Ensure that each node in the IBM Cloud Private cluster has the directory /var/log/journal. If the directory does not exist, create it.
Installation
You can enable the node problem detector and Draino either during installation through the config.yaml file or after installation in the management console by using a Helm chart.
Enabling the node-problem-detector-draino parameter during cluster installation
-
Following the installation procedure, during Step 3 customize your cluster, open the
/<installation_directory>/cluster/config.yamlfile. -
In the list of management services, set
node-problem-detector-drainotoenabled. For example:management_services: istio: disabled vulnerability-advisor: disabled storage-glusterfs: disabled storage-minio: disabled key-management-hsm: disabled platform-security-netpols: disabled node-problem-detector-draino: enabled - Save and exit the file.
The node problem detector and Draino are installed by the IBM Cloud Private installer during cluster installation.
Installing the node-problem-detector-draino chart for an existing cluster
Required user type or access level: Cluster administrator, team administrator, or operator
- Log in to the IBM Cloud Private management console.
-
Click Catalog.
-
Find the node-problem-detector-draino chart by using the search bar.
-
Select the node-problem-detector-draino chart. A readme file displays information about installing, uninstalling, configuring, and other chart details for node-problem-detector-draino.
-
To configure the chart, click Configure.
-
Name the Helm release and select the kube-system namespace from the menu. The name must consist of lowercase alphanumeric characters or dash characters (-), and must start and end with an alphanumeric character.
-
Ensure that you read and agree to the license agreement.
-
Optional: Customize the
All parametersfields to your preference. -
To deploy the node-problem-detector-draino chart and create an node-problem-detector-draino release, click Install.
Verifying the installation
After installation completes, verify the node-problem-detector-draino that you have enabled is created and running:
Ensure the corresponding Kubernetes pods are deployed and all containers are up. Run the following command:
kubectl -n kube-system get pods | grep -E "node-problem-detector|draino"
The output might resemble the following content:
npm-draino-57df88dc45-cls7r 1/1 Running 0 2h 10.1.96.125 <none>
npm-node-problem-detector-68x5s 1/1 Running 0 2h 10.1.16.147 <none>
npm-node-problem-detector-8rzkq 1/1 Running 0 2h 10.1.62.146 <none>
npm-node-problem-detector-b2vzb 1/1 Running 0 2h 10.1.75.82 <none>
npm-node-problem-detector-bgbs4 1/1 Running 0 2h 10.1.249.116 <none>
npm-node-problem-detector-ltvjn 1/1 Running 0 2h 10.1.96.126 <none>
npm-node-problem-detector-r2drx 1/1 Running 0 2h 10.1.93.218 <none>
npm-node-problem-detector-t99f2 1/1 Running 0 2h 10.1.32.179 <none>
You are now ready to monitor your cluster nodes.