IBM Support

Cloud Pak for Security: "machine-config" and "monitoring" cluster operators have "degraded" status

Troubleshooting


Problem

Upgrading from a version such as 4.8.42 to a later version, the cluster operators state they are at the latest version, but the "machine-config" and "monitoring" cluster operators have the "degraded" status.

Cause

The Machine Config Operator component in charge of managing each individual node is the Machine Config Daemon, which runs as a daemonset on openshift-machine-config-operator. If the system state differs in anyway from what it expects, it sets the machineconfigpool as Degraded and also reflects that in  machineconfiguration.openshift.io/state node annotation. It also stops taking any action in order to not break anything.

Diagnosing The Problem

  1. Check to see whether any machine configurationpools are degraded:
    $ oc get node
    NAME STATUS ROLES AGE VERSION
    master-0.ocp.example.net Ready master 34d v1.17.1+9d33dd3
    master-1.ocp.example.net Ready master 34d v1.17.1+9d33dd3
    master-2.ocp.example.net Ready master 34d v1.17.1+9d33dd3
    worker-0.ocp.example.net Ready worker 34d v1.17.1+9d33dd3
    worker-1.ocp.example.net Ready worker 34d v1.17.1+9d33dd3
    worker-2.ocp.example.net Ready worker 34d v1.17.1+912792b  <-------- Degraded
  2. Look in the logs for an error message similar to:
    Marking Degraded due to: unexpected on-disk state validating against rendered-worker-<node>: 
    expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<image>", 
    
    have "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:<image>"

Resolving The Problem

Procedure to resolve the issue.
  1. Locate the correct osImageURL for that node from the machine-config-daemon logs. An example would be similar to,
    quay.io/openshift-release-dev/ocp-v4.0-art-
    dev@sha256:328a1e57fe5281f4faa300167cdf63cfca1f28a9582aea8d6804e45f4c0522a8.
    Note: Different nodes might have a different osImageURL while in the middle of an upgrade. Regardless of the machine-os-content for that release, you need get the osImageURL stated in the logs for that node.
  2.  Access the failing node
    $ oc debug node/[node_name]
    sh-4.4# chroot /host
  3. If you are using a proxy server, export any proxy variables needed.
    sh-4.4# export HTTP_PROXY=myproxy.com:80
    sh-4.4# export HTTPS_PROXY=myproxy.com:80
  4. Where ${IMAGE} is the image obtained in step 1, run the command:
    sh-4.4# /run/bin/machine-config-daemon pivot "${IMAGE}"
  5. The running image from step 4 needs to look similar to:
    sh-4.4# /run/bin/machine-config-daemon pivot 'quay.io/openshift-release-dev/ocp-
    v4.0-art-
    dev@sha256:47e1213c98063dfd7f5ccae41e611a25446c8ac493cfdd05d8f1c46b61ab13d4'
    I1124 15:52:40.270660 36886 run.go:18] Running: nice -- ionice -c 3 oc image
    extract --path /:/run/mco-machine-os-content/os-content-736585590 --registry-
    config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-
    dev@sha256:47e1213c98063dfd7f5ccae41e611a25446c8ac493cfdd05d8f1c46b61ab13d4
    I1124 15:52:57.831857 36886 rpm-ostree.go:261] Running captured: rpm-ostree
    status --json
    I1124 15:52:58.300189 36886 rpm-ostree.go:179] Previous pivot:
    quay.io/openshift-release-dev/ocp-v4.0-art-
    dev@sha256:12c8c2c4fb915e49e2f1a42f5761b6f8cf1ee84393d22a3fe143bdabc98c05a8
    I1124 15:52:58.706233 36886 rpm-ostree.go:211] Pivoting to: 46.82.202011061621-0
    (944e410d59634e95ebffd364b148a1ac4008b1d323459f37cbca97d689722366)
    I1124 15:52:58.706282 36886 rpm-ostree.go:243] Executing rebase from repo path
    /run/mco-machine-os-content/os-content-736585590/srv/repo with customImageURL
    pivot://quay.io/openshift-release-dev/ocp-v4.0-art-
    dev@sha256:47e1213c98063dfd7f5ccae41e611a25446c8ac493cfdd05d8f1c46b61ab13d4 and
    checksum 944e410d59634e95ebffd364b148a1ac4008b1d323459f37cbca97d689722366
    I1124 15:52:58.706296 36886 rpm-ostree.go:261] Running captured: rpm-ostree
    rebase --experimental /run/mco-machine-os-content/os-content-736585590
    /srv/repo:944e410d59634e95ebffd364b148a1ac4008b1d323459f37cbca97d689722366
    --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-
    dev@sha256:47e1213c98063dfd7f5ccae41e611a25446c8ac493cfdd05d8f1c46b61ab13d4
    --custom-origin-description Managed by machine-config-operato
    Note: If the output running rpm-ostree rebase displays a similar error, go to step 6.
    error: error running rpm-ostree rebase --experimental /run/mco-machine-os-content
    /os-content-736585590
    /srv/repo:944e410d59634e95ebffd364b148a1ac4008b1d323459f37cbca97d689722366
    --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-
    dev@sha256:47e1213c98063dfd7f5ccae41e611a25446c8ac493cfdd05d8f1c46b61ab13d4
    --custom-origin-description Managed by machine-config-operator: error: No enabled
    repositories
  6. Run the rebase command adding the -C parameter.|
    sh-4.4# rpm-ostree rebase -C --experimental /run/mco-machine-os-content/<path in
    the error> --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-
    dev@sha256:<SHA image> --custom-origin-description "Managed by machine-config-
    operator"
  7. Restart the node by running the command:
    sh-4.4# reboot

     

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB24","label":"Security Software"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSTDPP","label":"IBM Cloud Pak for Security"},"ARM Category":[{"code":"a8m3p0000000rbnAAA","label":"Administration Task"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
31 October 2022

UID

ibm16831313