Serviceability known issues

List of all troubleshooting and known issues in Events, Log collection, and Call Home.

Events that go through the trap server do not get created in IBM Storage Fusion

Problem statement
Sometimes, events that go through the trap server do not get created in IBM Storage Fusion.
Resolution
If you see the following details in the trapserver logs, follow the steps as a workaround:
Listening for traps on 0.0.0.0:31620
 [fd8c:215d:178e:c0de:a94:efff:fef3:35cd]:41493 byte array parsed in is not a sequence
 [fd8c:215d:178e:c0de:a94:efff:fef3:3561]:35748 failed to parse sequence length0
 2021/10/13 14:20:48.124 [D] Recovered in resetServer, r=runtime error: slice bounds out of range [:-2592903709665718705]
Listening for traps on 0.0.0.0:31620
 [fd8c:215d:178e:c0de:a94:efff:fef3:3555]:58314 failed to parse sequence length0
 [fd8c:215d:178e:c0de:a94:efff:fef3:35cd]:34801 byte array parsed in is not a sequence
 [fd8c:215d:178e:c0de:a94:efff:fef3:35cd]:52070 byte array parsed in is not a sequence
 [fd8c:215d:178e:c0de:a94:efff:fef3:3555]:47601 parse error
 [fd8c:215d:178e:c0de:a94:efff:fef3:3585]:56764 length parse error @ idx 2
 [fd8c:215d:178e:c0de:a94:efff:fef3:3399]:44497 failed to parse sequence length0
 [fd8c:215d:178e:c0de:a94:efff:fef3:35cd]:57884 failed to parse sequence length0
 [fd8c:215d:178e:c0de:a94:efff:fef3:3585]:59094 length parse error @ idx 2
 [fd8c:215d:178e:c0de:a94:efff:fef3:3555]:40214 parse error
 [fd8c:215d:178e:c0de:a94:efff:fef3:3399]:37409 byte array parsed in is not a sequence
  • Restart the trapserver pod in the ibm-spectrum-fusion-ns.
  • Run the following command to delete all the ComputeMonitoring CRs that are present in the ibm-spectrum-fusion-ns namespace.
    oc delete cmo --all -n ibm-spectrum-fusion-ns
    Wait for the custom resource instances to get recreated.
  • Restart BMC of all the compute nodes by executing resetsp command from the BMC command line.

Log status shows complete for downloaded log file with 0-bytes size

Problem statement
The Log status shows complete for downloaded log file with of 0-bytes size. The status should be failed if logs are not collected, but it shows completed with 0-bytes size.
Cause
  1. Pods get evicted because of a lack of storage space.
  2. The ongoing jobs are taking time to get storage space, but if it takes more time, then they are marked as stale and automatically get cleaned up.
Resolution
There are two methods available to resolve this issue:
Method 1
  1. Delete the unnecessary logs through the IBM Storage Fusion HCI System user interface to get storage space. For more information, see Delete a log package.
  2. Delete the on going jobs that are taking a long time and retry the job again.
Method 2
  • You can increase the log collector PVC size by following the steps:
    1. Log in to the Red Hat® OpenShift® Container Platform web console.
    2. Go to Storage > PersistentVoulmeClaims.

      The PersistentVolumeClaims page gets displayed.

    3. Select the log collecter PVC that you want to modify.
    4. Click the ellipsis icon and select Expand PVC.

      The Expand PersistentVolumeClaims page gets displayed.

    5. Set the PVC value that you want and click Expand.

Log collection is not working

Problem statement
If you are not able to see the collected logs in the user interface or log collection jobs are intermittently not visible on the user interface, then it might be an issue of incorrect seLinuxOptions set in the pods.
Resolution
Follow the steps to resolve this issue:
  1. Run the following command to update the log collector seLinuxOptions from the namespace of the log collector deployment along with fsGroupChangePolicy.
    oc get namespace ibm-spectrum-fusion-ns -o jsonpath='{.metadata.annotations.openshift\.io/sa\.scc\.mcs}' | \ xargs -I {} oc patch deployment logcollector -n ibm-spectrum-fusion-ns --type='merge' -p "{\"spec\": {\"template\": {\"spec\": {\"securityContext\": {\"seLinuxOptions\": {\"level\": \"{}\"}, \"fsGroupChangePolicy\": \"OnRootMismatch\"}}}}}"
  2. Verify whether the log collector deployment is updated with the seLinuxOptions that are specified in the IBM Storage Fusion namespace annotation openshift.io/sa.scc.mcs.
  3. The pods get restarted and the log collector works as expected.

Known issues

  • In the Logs page of the IBM Storage Fusion user interface, if you select System Health Check option during log collection, then it takes longer than usual time to complete. It is observed that the log collection process might take 20 to 25 minutes. In some cases, it can be due to many directories and multiple IMM log file collection process.
  • For IBM Storage Scale warning events, the fixed status might be incorrect.
  • Sometimes, in the Events page, the Source column of the events list might be incorrect.
  • Events are not created for IBM Storage Scale events with entity_name fields that do not conform to the URL path components (^(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?$).
  • Call Home ticket creation can have a failed state when the system is not entitled or the Call Home server does not respond with the ticket number. Click Verify connection to check the test connection.
  • To prevent a deadlock condition, the event manager must be restarted every 24 hours. Run the following command to restart the event manager:
    oc rollout restart deployment eventmanager
  • Sometimes, the automatic upload of logs might not happen and the Call Home would fail. In such cases, manually upload the logs.
  • If you power off a control node wherein the trap server pod is running, then the migration of trap server pods to a different node fails. As a result, the trap server pod may get stuck in the terminating state, and some SNMP trap events may fail to capture and display.
  • In rare scenarios, the Events page comes up as empty because of a backend error due to a load with a 504 Gateway error. Generally, this page comes up after sometime automatically as the system recovers, so please try again after sometime.
  • One possible reason is log collector pod runs out of space and no space is available on it. This can happen when too many logs are collected in short period. The solution is to delete already collected logs from Fusion UI which are no longer required.
    Workaround
    • From the title bar, click the help icon and select Support logs.
    • Identify and delete logs through ellipsis menu in the Support logs page.
  • If Data Foundation log collection request or requests for multiple log packages collected simultaneously fails or stuck for more than 6 hours, then update the pod memory to 12000MiB instead of 6000MiB in the log collector deployment at spec.containers[0].resources.limits.memory and then attempt the log collection again one by one.