Troubleshooting

This section presents a list of issues that could be encountered during the operation of the Spyre Operator.

Known issues

Missing /etc/aiu/senlib_config.json
  • There is no pre-defined volume mounted to path /etc/aiu. This folder is reserved for volume mounted by the spyre device plugin only.
For example, the following volume mount must be removed:
volumeMounts:
- mountPath: /etc/aiu
name: config

Pod is not scheduled as expected (Pending)

  • Check requested resource name especially for the experimental per-device allocation pool.
  • Confirm status of SpyreNodeState and node capacity/allocatable.
  • Run the command to check the allocatable resources of the node
    oc describe node <workdernode-name> | grep Allocatable -A11

Container status unknown

This state could happen when the spyre resource is in a race condition between multiple resource pools such as default pool and experimental mode's per-device allocation pool. Restarting the pod manually should put it back into a pending state until the device is released.

ERROR client-go Failed to update lock: resource name may not be empty

When "ERROR client-go Failed to update lock: resource name may not be empty "is present in the operator log file and is followed by an operator restart indicates that the operator(manager) process could not properly connect to the kubernetes API server. The operator log file can be viewed using your choice of tools. The oc that can be used in this case is: oclogs spyre-operator-XXX -- substitute the XXX token for the pod hash code.

Recommended actions

This is an expected behavior for any operator that is implemented using the operator runtime and its an indicator of network issues, communications between the node and the API server is interrupted. If the error is sporadic and the same error is present in the log of other operators, it can safely be ignored.

If the error is persistent, validate the network connection between the node executing the operator pod and the API server is there and there are no other underlying issues.

If you have enough resources in the cluster consider increasing the operator deployment replica count to attenuate any service disruptions.

MutatingAdmissionWebhook failed to complete mutation in xxs

This state is caused by the accumulated WebhookGuardWaitTimeInSec is longer than webhook timeout. WebhookGuardWaitTimeInSec is introduced to avoid error due to the conflict. The conflict between any pod A and pod B is determined by requested resource as below:
Pod A Pod B Conflict
spyre_vf spyre_vf No
spyre_vf spyre_vf_(device) Yes
spyre_vf_(device) spyre_vf_(device) No

Actions

  • Use Deployment instead of Pod to deploy your application (recommended).
  • If you choose to use Pod, manually retry the deployment after a few minutes.

SpyreClusterPolicy healthChecker pods in CrashLoopBackOff

After applying the SpyreClusterPolicy (version 1.2.0), the policy remains in a Not Ready state.

The following symptoms are observed:

  • spyre-health-checker-* pods are in CrashLoopBackOff in the spyre-operator namespace.
  • The cluster policy does not complete reconciliation.
  • Error logs from healthChecker pod
    ERROR server/server.go:88 failed to listen: listen unix /usr/local/etc/device-plugins/spyre-sockets/health-checker.sock: bind: permission denied
    
    FATAL health-checker/main.go:47 Error starting insecure gRPC Server:
    listen unix /usr/local/etc/device-plugins/spyre-sockets/health-checker.sock: bind: permission denied
    

Cause

The issue occurs due to insufficient permissions on the host path used for device plugin sockets:

/usr/local/etc/device-plugins/

The spyre-health-checker container attempts to create and bind a UNIX socket:

/usr/local/etc/device-plugins/spyre-sockets/health-checker.sock

However, without proper host-level permissions, the operation fails with:

bind: permission denied

Resolution

Apply the required MachineConfig to set appropriate permissions for the device plugin directory on cluster nodes.

Refer to IBM documentation: https://www.ibm.com/docs/en/rhocp-ibm-z?topic=operatorhub-configure-machineconfig#concept_jnv_fwm_lhc__title__4

  • Steps
    1. Apply the MachineConfig:
      oc apply -f
              <machineconfig>.yaml
      ex : oc apply -f
              50-spyre-device-plugin-selinux-minimal.yaml
    2. Wait for MachineConfigPool rollout to complete:
    3. oc get mcp
    4. Once nodes are updated, delete the failing pods:
      oc delete pod -n spyre-operator -l
              app=spyre-health-checker
    5. Verify:
      • Pods transition to Running
      • SpyreClusterPolicy reaches Ready state

Example Outcome After Fix

  • spyre-health-checker-* pods start successfully
  • No permission errors in logs
  • SpyreClusterPolicy status becomes Ready

Additional Notes

  • This issue specifically impacts deployments where healthChecker is enabled in SpyreClusterPolicy.
  • The same permission requirement may apply to other components using the device plugin socket directory.
  • It is recommended to document and include this MachineConfig step as part of pre-installation or post-install validation