Troubleshooting

This section presents a list of issues that could be encountered during the operation of the Spyre Operator.

Known issues

Missing /etc/aiu/senlib_config.json

There is no pre-defined volume mounted to path /etc/aiu. This folder is reserved for volume mounted by the spyre device plugin only.

For example, the following volume mount must be removed:

volumeMounts:
- mountPath: /etc/aiu
name: config

Pod is not scheduled as expected (Pending)

Check requested resource name especially for the experimental per-device allocation pool.
Confirm status of SpyreNodeState and node capacity/allocatable.
Run the command to check the allocatable resources of the node
```
oc describe node <workdernode-name> | grep Allocatable -A11
```

Container status unknown

This state could happen when the spyre resource is in a race condition between multiple resource pools such as default pool and experimental mode's per-device allocation pool. Restarting the pod manually should put it back into a pending state until the device is released.

ERROR client-go Failed to update lock: resource name may not be empty

When "ERROR client-go Failed to update lock: resource name may not be empty "is present in the operator log file and is followed by an operator restart indicates that the operator(manager) process could not properly connect to the kubernetes API server. The operator log file can be viewed using your choice of tools. The oc that can be used in this case is: oclogs spyre-operator-XXX -- substitute the XXX token for the pod hash code.

Recommended actions

This is an expected behavior for any operator that is implemented using the operator runtime and its an indicator of network issues, communications between the node and the API server is interrupted. If the error is sporadic and the same error is present in the log of other operators, it can safely be ignored.

If the error is persistent, validate the network connection between the node executing the operator pod and the API server is there and there are no other underlying issues.

If you have enough resources in the cluster consider increasing the operator deployment replica count to attenuate any service disruptions.

MutatingAdmissionWebhook failed to complete mutation in xxs

This state is caused by the accumulated WebhookGuardWaitTimeInSec is longer than webhook timeout. WebhookGuardWaitTimeInSec is introduced to avoid error due to the conflict. The conflict between any pod A and pod B is determined by requested resource as below:

Pod A	Pod B	Conflict
spyre_vf	spyre_vf	No
spyre_vf	spyre_vf_(device)	Yes
spyre_vf_(device)	spyre_vf_(device)	No

Actions

Use Deployment instead of Pod to deploy your application (recommended).
If you choose to use Pod, manually retry the deployment after a few minutes.

SpyreClusterPolicy healthChecker pods in CrashLoopBackOff

After applying the SpyreClusterPolicy (version 1.2.0), the policy remains in a Not Ready state.

The following symptoms are observed:

spyre-health-checker-* pods are in CrashLoopBackOff in the spyre-operator namespace.
The cluster policy does not complete reconciliation.

Error logs from healthChecker pod

ERROR server/server.go:88 failed to listen: listen unix /usr/local/etc/device-plugins/spyre-sockets/health-checker.sock: bind: permission denied

FATAL health-checker/main.go:47 Error starting insecure gRPC Server:
listen unix /usr/local/etc/device-plugins/spyre-sockets/health-checker.sock: bind: permission denied

Cause

The issue occurs due to insufficient permissions on the host path used for device plugin sockets:

/usr/local/etc/device-plugins/

The spyre-health-checker container attempts to create and bind a UNIX socket:

/usr/local/etc/device-plugins/spyre-sockets/health-checker.sock

However, without proper host-level permissions, the operation fails with:

bind: permission denied

Resolution

Apply the required MachineConfig to set appropriate permissions for the device plugin directory on cluster nodes.

Refer to IBM documentation: https://www.ibm.com/docs/en/rhocp-ibm-z?topic=operatorhub-configure-machineconfig#concept_jnv_fwm_lhc__title__4

Steps
1. Apply the MachineConfig:
```
oc apply -f
        <machineconfig>.yaml
```
```
ex : oc apply -f
        50-spyre-device-plugin-selinux-minimal.yaml
```
2. Wait for MachineConfigPool rollout to complete:
3. oc get mcp
4. Once nodes are updated, delete the failing pods:
```
oc delete pod -n spyre-operator -l
        app=spyre-health-checker
```
5. Verify:
  - Pods transition to Running
  - SpyreClusterPolicy reaches Ready state

Example Outcome After Fix

spyre-health-checker-* pods start successfully
No permission errors in logs
SpyreClusterPolicy status becomes Ready

Additional Notes

This issue specifically impacts deployments where healthChecker is enabled in SpyreClusterPolicy.
The same permission requirement may apply to other components using the device plugin socket directory.
It is recommended to document and include this MachineConfig step as part of pre-installation or post-install validation