IBM Storage Fusion Data Cataloging known issues
List of all troubleshooting and known issues exist in version 2.1 of Data Cataloging.
Known issues and troubleshooting
The following known issues exist in Data Cataloging service, with workarounds included wherever possible. If you come across an issue that cannot be solved by using these instructions, contact IBM support .
- Data Cataloging service in Metro-DR setup shows in Degraded state
- COS connection reporting scan aborted due to inactivity
- Database connection issue after reboot
- Image pull error due to authentication failure
- Visual query builder search terms overrides SQL search when going into individual mode
- LDAPS configuration failing if dollar sign is in password
- Content search policy missing files
- REST API returns token with unprintable characters
- Querying available applications on Docker Hub is not working
- Running applications from the catalog
- Scale Live Events do not get populated due to the timestamp field value being invalid
- Adding S3 connection gives false negative
- When installation is at 80%, up to six pods might experience a Crash-loop Back-off error
- Scale datamover AFM and ILM capabilities not working properly due to SDK misleading function when deploying an application
- Policies are not finished, resulting in a hanging state
- Data Cataloging service goes degraded state after IBM Storage Fusion HCI System rack restart
Data Cataloging service in Metro-DR setup shows in Degraded state
- Diagnosis
-
- Data Cataloging service is in degraded state
- Run the following command to check whether the Pod
isd-db2whrest
is not ready:oc -n ibm-data-cataloging get pod -l role=db2whrest
- Run the following command to check whether Db2 retries the network check and fails because of
the timeout:
oc -n ibm-data-cataloging logs -l type=engine --tail=100
Example output:+ timeout 1 tracepath -l 29 c-isd-db2u-1.c-isd-db2u-internal + [[ 17 -lt 120 ]] + (( n++ )) + echo 'Command failed. Attempt 18/120:' Command failed. Attempt 18/120:
- Resolution
- Increase the time before timeout, usually change from 1 second to 3-5 seconds.
- Modify the timeout from 1 to 3 in
isd-db2u-0
:oc -n ibm-data-cataloging exec c-isd-db2u-0 -- sudo sed -i 's/timeout 1 tracepath/timeout 3 tracepath/g' /db2u/scripts/include/common_functions.sh
- Wait until the current attempt exceeds the predefined 120 retries. After it restarts, it picks
the updated value:
oc -n ibm-data-cataloging logs -l type=engine --tail=50
- Monitor
db2whrest
pod readiness:oc -n ibm-data-cataloging get pod -l role=db2whrest -w
- Modify the timeout from 1 to 3 in
COS connection reporting scan aborted due to inactivity
If a COS connection scan fails with the error “Scan aborted because of a long period of inactivity”, it can be resolved by editing the settings file connections/cos/scan/scanner-settings.json within the data PV and choosing a higher value for notifier_timeout than the default value of 120 seconds. The change will be picked on the next scan. No pod restart is required.
Database connection issue after reboot
If an unexpected cluster update or node reboot causes database connection issues. For resolution, see the steps mentiond in the Data Cataloging database schema job is not in a completed state during installation or upgrade.
Image pull error due to authentication failure
oc delete secret image-registry-pull-secret
HOST=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}')
oc create secret docker-registry image-registry-pull-secret \
--docker-server="${HOST}" \
--docker-username=kubeadmin \
--docker-password="$(oc whoami -t)"
for account in spectrum-discover-operator strimzi-cluster-operator spectrum-discover-ssl-
zookepper spectrum-discover-sasl-zookeeper; do oc secrets link $account image-registry-pull-
secret --for=pull; done
Visual query builder search terms overrides SQL search when going into individual mode
If a search is started in the query builder, then changed to SQL mode, the initial group search is as expected but if expanded to individual records it uses the query builder terms as the base. A workaround is to clear the visual query before changing to SQL query.
LDAPS configuration failing if dollar sign is in password
Currently, the dollar sign is not supported on passwords for ldaps configuration. A workaround is to create a password without the dollar sign in it.
Content search policy missing files
If there are issues with the incorrect expected data count while running a policy, you must verify that the connection is active, and rescan to get the latest data ingested to Data Cataloging. After successful upgrade of Data Cataloging, a rescan of existing connections is recommended.
REST API returns token with unprintable characters
$ curl -k -H "Authorization: Bearer ${TOKEN}" https://$SDHOST/policyengine/v1/tags
curl: (92) HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR (err 1)
`TOKEN=$(curl -i -k https://$SDHOST/auth/v1/token -u "$SDUSER:$SDPSWD" | grep -i x-auth-token |
awk '{print $2}')`
`TOKEN=$(curl -i -k https://$SDHOST/auth/v1/token -u "$SDUSER:$SDPSWD" | grep -i x-auth-token |
awk '{print $2}' | tr -d '\r')`
Querying available applications on Docker Hub is not working
$ tcurl https://${OVA}/api/application/appcatalog/publicregistry | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 78 100 78 0 0 135 0 --:--:-- --:--:-- --:--:-- 135
{
"success": "false",
"message": "Could not retrieve available applications."
}
To avoid this issue, you need to open a browser and access the following URL:Docker Documentation.
The above link retrieves the list of Data Cataloging applications available in the public registry. The image name of the application that is selected from the query output can be used to create a JSON file with the information that is needed to run the application, as shown in the following: Spectrum Discover Documentation.
Running applications from the catalog
Querying and running available applications from the catalog. Currently, the REST API public registry endpoint to retrieve the list of available applications from the DockerHub is not working. For that reason, the Data Cataloging application catalog is only available in the following repository:Spectrum Discover App Catalog
Scale Live Events do not get populated due to the timestamp field value being invalid
cd /mnt/blumeta0/home/bluadmin
sudo ls *.bad > /tmp/output.bad
sed 's/bad/log/' /tmp/output.bad > /tmp/output.log
sudo zip /tmp/output.zip -r . -i@/tmp/output.bad -i@/tmp/output.log
rm /tmp/output.bad
rm /tmp/output.log
cd /tmp
The workaround is to perform schedule scans of the IBM Storage Scale connections so that all file changes are up to date.
Adding S3 connection gives false negative
When a connection of type S3 is added through Data Cataloging user interface, it gives an undefined error message.
Workaround:
Refreshing the browser removes the error message, and the connections table shows that the S3 connection was successful.
When installation is at 80%, up to six pods might experience a Crash-loop
Back-off
error
This issue happens when pods are waiting for the db-schema pod to finish the internal schema upgrades.
Workaround:
When the db-schema pod goes into "Running" state, after about 6 restarts, the pods go into running state, and installation completes successfully.
Scale datamover AFM and ILM capabilities not working properly due to SDK misleading function when deploying an application
When deploying Data Cataloging service deployments
pods scaleafmdatamover
and scaleilmdatamover
might show an error
on logs when application is deployed.
For example:
2023-07-20 02:51:54,311 - ibm_spectrum_discover_application_sdk.ApplicationLib - INFO - Invoking conn manager at http://172.30.255.202:80/connmgr/v1/internal/connections
Traceback (most recent call last):
File "/application/ScaleAFMDataMover.py", line 1023, in
APPLICATION = ScaleAFMApplicationBase(REGISTRATION_INFO)
File "/application/ScaleAFMDataMover.py", line 112, in init
self.conn_details = self.get_connection_details()
File "/usr/local/lib/python3.9/site-packages/ibm_spectrum_discover_application_sdk/ApplicationLib.py", line 492, in get_connection_details
raise Exception(err)
UnboundLocalError: local variable 'err' referenced before assignment
2023-07-20 02:51:54,367 INFO exited: scaleafm-datamover (exit status 1; not expected)
2023-07-20 02:51:55,368 INFO gave up: scaleafm-datamover entered FATAL state, too many start retries too quickly
- Cause
- SDK bug when deploying applications on DCS, causing deployed applications not to behave properly and pods in the incorrect state to show errors.
- Resolution
- Once identified this behavior, then follow the steps to resolve this issue:
- Verify the connmgr API is running and accessible through HTTP (curl to the connmgr service would be enough).
- Check the application pod and remove it to be redeployed.
Policies are not finished, resulting in a hanging state
The policies are not finished, which results in a hanging state.
- Cause
- Inconsistent behavior on policies results in a no finish status. The issue is still under investigation.
- Resolution
- Identify the policy engine pod and eliminate it; OpenShift® Container Platform will create another pod automatically, and policies will be
executed and finished properly after the pod
creation.Example:
oc -n ibm-data-cataloging delete pod -l role=policyengine
Data Cataloging service goes degraded state after IBM Storage Fusion HCI System rack restart
The Data Cataloging service goes degraded after some of the nodes are restarted or after IBM Storage Fusion HCI System rack restart. Several pods go pending with the following errors Unable to attach or mount volumes: unmounted volumes=[spectrum-discover-db2wh], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition and MountVolume.SetUp failed for volume "xxx" : rpc error: code = Internal desc = staging path yyy for volume zzz is not a mountpoint.
- Resolution
- Run the following steps to resolve this issue:
- Run the following commands to make the compute nodes unschedule and drain them one after
one.
oc adm cordon worker4.fusion-test-zlinux.cp.fyre.ibm.com oc adm drain worker4.fusion-test-zlinux.cp.fyre.ibm.com --ignore-daemonsets --force --delete-emptydir-data
- After the node is drained, ensure it schedules again and then proceed to another node with the
same process.
It removes stale directory entries from nodes that are detected as mount points.
- The issue automatically resolves and Data Cataloging service must be in a healthy state after all the nodes are backed up.
- Run the following commands to make the compute nodes unschedule and drain them one after
one.