Troubleshooting
This section presents a list of issues that could be encountered during the operation of the Spyre Operator.
Known issues
- There is no pre-defined volume mounted to path
/etc/aiu. This folder is reserved for volume mounted by the spyre device plugin only.
volumeMounts:
- mountPath: /etc/aiu
name: configPod is not scheduled as expected (Pending)
- Check requested resource name especially for the experimental per-device allocation pool.
- Confirm status of SpyreNodeState and node capacity/allocatable.
- Run the command to check the allocatable resources of the
node
oc describe node <workdernode-name> | grep Allocatable -A11
Container status unknown
This state could happen when the spyre resource is in a race condition between multiple resource pools such as default pool and experimental mode's per-device allocation pool. Restarting the pod manually should put it back into a pending state until the device is released.
ERROR client-go Failed to update lock: resource name may not be empty
When "ERROR client-go Failed to update lock: resource name may not be empty "is
present in the operator log file and is followed by an operator restart indicates that the
operator(manager) process could not properly connect to the kubernetes API server. The operator log
file can be viewed using your choice of tools. The oc that can be used in this case
is: oclogs spyre-operator-XXX -- substitute the XXX token for the pod hash
code.
Recommended actions
This is an expected behavior for any operator that is implemented using the operator runtime and its an indicator of network issues, communications between the node and the API server is interrupted. If the error is sporadic and the same error is present in the log of other operators, it can safely be ignored.
If the error is persistent, validate the network connection between the node executing the operator pod and the API server is there and there are no other underlying issues.
If you have enough resources in the cluster consider increasing the operator deployment replica count to attenuate any service disruptions.
MutatingAdmissionWebhook failed to complete mutation in xxs
WebhookGuardWaitTimeInSec is longer than
webhook timeout. WebhookGuardWaitTimeInSec is introduced to avoid error due to the
conflict. The conflict between any pod A and pod B is determined by requested resource as
below:
| Pod A | Pod B | Conflict |
|---|---|---|
| spyre_vf | spyre_vf | No |
| spyre_vf | spyre_vf_(device) | Yes |
| spyre_vf_(device) | spyre_vf_(device) | No |
Actions
- Use
Deploymentinstead ofPodto deploy your application (recommended). - If you choose to use
Pod, manually retry the deployment after a few minutes.
SpyreClusterPolicy healthChecker pods in CrashLoopBackOff
After applying the SpyreClusterPolicy (version 1.2.0), the policy remains in a
Not Ready state.
The following symptoms are observed:
spyre-health-checker-*pods are in CrashLoopBackOff in thespyre-operatornamespace.- The cluster policy does not complete reconciliation.
- Error logs from healthChecker
pod
ERROR server/server.go:88 failed to listen: listen unix /usr/local/etc/device-plugins/spyre-sockets/health-checker.sock: bind: permission denied FATAL health-checker/main.go:47 Error starting insecure gRPC Server: listen unix /usr/local/etc/device-plugins/spyre-sockets/health-checker.sock: bind: permission denied
Cause
The issue occurs due to insufficient permissions on the host path used for device plugin sockets:
/usr/local/etc/device-plugins/
The spyre-health-checker container attempts to create and bind a UNIX
socket:
/usr/local/etc/device-plugins/spyre-sockets/health-checker.sock
However, without proper host-level permissions, the operation fails with:
bind: permission denied
Resolution
Apply the required MachineConfig to set appropriate permissions for the device plugin directory on cluster nodes.
Refer to IBM documentation: https://www.ibm.com/docs/en/rhocp-ibm-z?topic=operatorhub-configure-machineconfig#concept_jnv_fwm_lhc__title__4
- Steps
- Apply the MachineConfig:
oc apply -f <machineconfig>.yamlex : oc apply -f 50-spyre-device-plugin-selinux-minimal.yaml - Wait for MachineConfigPool rollout to complete:
oc get mcp- Once nodes are updated, delete the failing
pods:
oc delete pod -n spyre-operator -l app=spyre-health-checker - Verify:
- Pods transition to Running
SpyreClusterPolicyreaches Ready state
- Apply the MachineConfig:
Example Outcome After Fix
spyre-health-checker-*pods start successfully- No permission errors in logs
SpyreClusterPolicystatus becomes Ready
Additional Notes
- This issue specifically impacts deployments where healthChecker is enabled in SpyreClusterPolicy.
- The same permission requirement may apply to other components using the device plugin socket directory.
- It is recommended to document and include this MachineConfig step as part of pre-installation or post-install validation