IBM Storage Fusion Data Cataloging known issues

List of all issues in the Data Cataloging service along with its resolution.

The following known issues exist in Data Cataloging service, with workarounds included wherever possible. If you come across an issue that cannot be solved by using these instructions, contact IBM support .

Data Cataloging service in Metro-DR setup shows in Degraded state

Diagnosis
  1. Data Cataloging service is in degraded state
  2. Run the following command to check whether the Pod isd-db2whrest is not ready:
    oc -n ibm-data-cataloging get pod -l role=db2whrest
  3. Run the following command to check whether Db2 retries the network check and fails because of the timeout:
    oc -n ibm-data-cataloging logs -l type=engine --tail=100
    Example output:
    + timeout 1 tracepath -l 29 c-isd-db2u-1.c-isd-db2u-internal
    + [[ 17 -lt 120 ]]
    + (( n++ ))
    + echo 'Command failed. Attempt 18/120:'
    Command failed. Attempt 18/120:
Resolution
Increase the time before timeout, usually change from 1 second to 3-5 seconds.
  1. Modify the timeout from 1 to 3 in isd-db2u-0:
    oc -n ibm-data-cataloging exec c-isd-db2u-0 -- sudo sed -i 's/timeout 1 tracepath/timeout 3 tracepath/g' /db2u/scripts/include/common_functions.sh
    
  2. Wait until the current attempt exceeds the predefined 120 retries. After it restarts, it picks the updated value:
    oc -n ibm-data-cataloging logs -l type=engine --tail=50
    
  3. Monitor db2whrest pod readiness:
    oc -n ibm-data-cataloging get pod -l role=db2whrest -w

COS connection reporting scan aborted due to inactivity

If a COS connection scan fails with the error “Scan aborted because of a long period of inactivity”, it can be resolved by editing the settings file connections/cos/scan/scanner-settings.json within the data PV and choosing a higher value for notifier_timeout than the default value of 120 seconds. The change will be picked on the next scan. No pod restart is required.

Database connection issue after reboot

If an unexpected cluster update or node reboot causes database connection issues. For resolution, see the steps mentioned in the Data Cataloging database schema job is not in a completed state during installation or upgrade section. .

Image pull error due to authentication failure

Problem statement
The OpenShift® Container Platform login token expires occasionally, and as this is the container image registry password, this breaks the service account access to the registry.
Resolution
If a pod is failing to pull an image from the registry with an authentication error, then re-create the image-registry-pull-secret and relink the service accounts to the new secret:
oc delete secret image-registry-pull-secret
HOST=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}')
oc create secret docker-registry image-registry-pull-secret \
    --docker-server="${HOST}" \
    --docker-username=kubeadmin \
    --docker-password="$(oc whoami -t)"
for account in spectrum-discover-operator strimzi-cluster-operator spectrum-discover-ssl-
zookepper spectrum-discover-sasl-zookeeper; do oc secrets link $account image-registry-pull-
secret --for=pull; done

Visual query builder search terms overrides SQL search when going into individual mode

If a search is started in the query builder, then changed to SQL mode, the initial group search is as expected but if expanded to individual records it uses the query builder terms as the base. A workaround is to clear the visual query before changing to SQL query.

LDAPS configuration failing if dollar sign is in password

Currently, the dollar sign is not supported on passwords for ldaps configuration. A workaround is to create a password without the dollar sign in it.

Content search policy missing files

If there are issues with the incorrect expected data count while running a policy, you must verify that the connection is active, and rescan to get the latest data ingested to Data Cataloging. After successful upgrade of Data Cataloging, a rescan of existing connections is recommended.

REST API returns token with unprintable characters

It is a noted issue that a carriage return (\r) is included at the end of HTTP response headers due to an issue with curl. This has been known to occasionally break scripts that use an auth token from the Data Cataloging appliance as shown here:
$ curl -k -H "Authorization: Bearer ${TOKEN}" https://$SDHOST/policyengine/v1/tags
curl: (92) HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR (err 1)
As such, it is recommended to filter out the \r character. If you have a line like the following in bash:
`TOKEN=$(curl -i -k https://$SDHOST/auth/v1/token -u "$SDUSER:$SDPSWD" | grep -i x-auth-token |
awk '{print $2}')`
Simply add a | tr -d ‘\r’ at the end to avoid running into this issue:
`TOKEN=$(curl -i -k https://$SDHOST/auth/v1/token -u "$SDUSER:$SDPSWD" | grep -i x-auth-token |
awk '{print $2}' | tr -d '\r')`

Querying available applications on Docker Hub is not working

When you retrieve the list of available applications on Docker Hub by using the public registry endpoint, the query retrieves an empty response:
$ tcurl https://${OVA}/api/application/appcatalog/publicregistry | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    78  100    78    0     0    135      0 --:--:-- --:--:-- --:--:--   135
{
  "success": "false",
  "message": "Could not retrieve available applications."
}

To avoid this issue, you need to open a browser and access the following URL:Docker Documentation.

The above link retrieves the list of Data Cataloging applications available in the public registry. The image name of the application that is selected from the query output can be used to create a JSON file with the information that is needed to run the application, as shown in the following: Spectrum Discover Documentation.

Running applications from the catalog

Querying and running available applications from the catalog. Currently, the REST API public registry endpoint to retrieve the list of available applications from the DockerHub is not working. For that reason, the Data Cataloging application catalog is only available in the following repository:Spectrum Discover App Catalog

Scale Live Events do not get populated due to the timestamp field value being invalid

Problem statement
Live events for IBM Storage Scale connections do not work as expected. After the initial scan of an IBM Storage Scale connection, only file deletions will result in a live update of the files that were discovered by the scan. If a file is modified or added, the live update will fail, and there will be no change reflected in the Data Cataloging product. This error is recorded by the DB2 product and can be surfaced by extracting the bad updates and the corresponding log files from one of the db2 pods. The following script executed within one of the db2 pods would extract these errors for analysis.
cd /mnt/blumeta0/home/bluadmin
sudo ls *.bad > /tmp/output.bad
sed 's/bad/log/'  /tmp/output.bad > /tmp/output.log
sudo zip /tmp/output.zip -r . -i@/tmp/output.bad -i@/tmp/output.log
rm /tmp/output.bad
rm /tmp/output.log
cd /tmp
Resolution
The workaround is to perform schedule scans of the IBM Storage Scale connections so that all file changes are up to date.

Adding S3 connection gives false negative

Problem statement
When a connection of type S3 is added through Data Cataloging user interface, it gives an undefined error message.
Resolution
Refreshing the browser removes the error message, and the connections table shows that the S3 connection was successful.

When installation is at 80%, up to six pods might experience a Crash-loop Back-off error

Problem statement
This issue happens when pods are waiting for the db-schema pod to finish the internal schema upgrades. The db-schema pod fails many times because it cannot connect to Db2 database.
Resolution
When the db-schema pod goes into "Running" state, after about 6 restarts, the pods go into running state, and installation completes successfully.

After 6 restarts, if db-schema pod is still failing, then re-install Data Cataloging service because Db2 got a problem during installation and cannot start. For steps to uninstall Data Cataloging service, see Uninstalling Data Cataloging.

Also make sure your cluster has enough resources.

Scale datamover AFM and ILM capabilities not working properly due to SDK misleading function when deploying an application

Problem statement
When deploying Data Cataloging service deployments pods scaleafmdatamover and scaleilmdatamover might show an error on logs when application is deployed.

For example:

2023-07-20 02:51:54,311 - ibm_spectrum_discover_application_sdk.ApplicationLib - INFO - Invoking conn manager at http://172.30.255.202:80/connmgr/v1/internal/connections
Traceback (most recent call last):
File "/application/ScaleAFMDataMover.py", line 1023, in
APPLICATION = ScaleAFMApplicationBase(REGISTRATION_INFO)
File "/application/ScaleAFMDataMover.py", line 112, in init
self.conn_details = self.get_connection_details()
File "/usr/local/lib/python3.9/site-packages/ibm_spectrum_discover_application_sdk/ApplicationLib.py", line 492, in get_connection_details
raise Exception(err)
UnboundLocalError: local variable 'err' referenced before assignment
2023-07-20 02:51:54,367 INFO exited: scaleafm-datamover (exit status 1; not expected)
2023-07-20 02:51:55,368 INFO gave up: scaleafm-datamover entered FATAL state, too many start retries too quickly
Cause
SDK bug when deploying applications on DCS, causing deployed applications not to behave properly and pods in the incorrect state to show errors.
Resolution
Once identified this behavior, then follow the steps to resolve this issue:
  1. Verify the connmgr API is running and accessible through HTTP (curl to the connmgr service would be enough).
  2. Check the application pod and remove it to be redeployed.

Policies are not finished, resulting in a hanging state

Problem statement
The policies are not finished, which results in a hanging state.
Cause
Inconsistent behavior on policies results in a no finish status. The issue is still under investigation.
Resolution
Identify the policy engine pod and eliminate it; OpenShift Container Platform will create another pod automatically, and policies will be executed and finished properly after the pod creation.
Example:
oc -n ibm-data-cataloging delete pod -l role=policyengine

Data Cataloging service goes degraded state after IBM Storage Fusion HCI System rack restart

Problem statement
The Data Cataloging service goes degraded after some of the nodes are restarted or after IBM Storage Fusion HCI System rack restart. Several pods go pending with the following errors Unable to attach or mount volumes: unmounted volumes=[spectrum-discover-db2wh], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition and MountVolume.SetUp failed for volume "xxx" : rpc error: code = Internal desc = staging path yyy for volume zzz is not a mountpoint.
Resolution
Run the following steps to resolve this issue:
  1. Run the following commands to make the compute nodes unschedule and drain them one after one.
    oc adm cordon worker4.fusion-test-zlinux.cp.fyre.ibm.com
    oc adm drain worker4.fusion-test-zlinux.cp.fyre.ibm.com --ignore-daemonsets --force --delete-emptydir-data
  2. After the node is drained, ensure it schedules again and then proceed to another node with the same process.

    It removes stale directory entries from nodes that are detected as mount points.

  3. The issue automatically resolves and Data Cataloging service must be in a healthy state after all the nodes are backed up.

Data Cataloging service is unhealthy due to etcd crash

Problem statement
The Data Cataloging service goes into an unhealthy state due to etcd crash.
Diagnosis
Run the following steps to diagnose this issue:
  1. Run the following command to check etcd pods that are not running.
    oc -n ibm-data-cataloging get pod -l component=etcd
  2. Run the following command to check whether the logs show this error error":"wal: crc mismatch.
    oc -n ibm-data-cataloging logs -l component=etcd
Resolution
Run the following steps to resolve this issue:
  1. Run the following command to scale down etcd.
    oc -n ibm-data-cataloging scale --replicas=0 sts/c-isd-etcd
  2. Run the following command to remove the affected etcd files.
    oc -n ibm-data-cataloging rsh c-isd-db2u-0
  3. Run the following commands to refresh the etcd.
    sudo rm -rf /mnt/blumeta0/etcd/c-isd-etcd-0/default.etcd
    sudo mv /mnt/blumeta0/etcd/c-isd-etcd-0/member_id /mnt/blumeta0/etcd/c-isd-etcd-0/member_id.bk
    exit
  4. Run the following command to scale up etcd.
    oc -n ibm-data-cataloging scale --replicas=1 sts/c-isd-etcd
  5. Run the following command to restart Db2.
    oc -n ibm-data-cataloging rsh c-isd-db2u-0
  6. Run the following command to restart Db2 high availability system.
    sv stop wolverine
    sv start wolverine
    exit

Db2 license does not display correctly on the upgrade set up

Problem statement
The Db2 license displays incorrectly on the Data Cataloging service upgrade set up.
Resolution
Run the following steps to resolve this issue:
  1. Run the following command to get the subsequent Data Cataloging scoped content.
    oc project ibm-data-cataloging
  2. For the new license to take effect, delete the Db2 engine pods for the Db2uCluster or Db2uInstance:
    oc delete $(oc get po -l type=engine,formation_id=isd -oname)
  3. Once the new Db2 pod is ready, verify the updated Db2 license:
    oc exec -it  c-isd-db2u-0 -- su - db2inst1 -c "db2licm -l"

    For more about Db2 community edition license certificate key, see Upgrading your Db2 Community Edition license certificate key

ConstraintsNotSatisfiable for db2u-operator

Problem statement
The Db2 version used by Data Cataloging service 2.1.6 got removed from the latest version of the IBM Operator Catalog.
Resolution
Follow the steps to resolve the issue:
  1. Add or update the IBM Catalog Source object ibm-operator-catalog in the openshift-marketplace namespace during install or upgrade.
    apiVersion: operators.coreos.com/v1alpha1
    kind: CatalogSource
    metadata:
      name: ibm-operator-catalog-old
      namespace: openshift-marketplace
    spec:
      displayName: IBM Operator Catalog
      publisher: IBM
      sourceType: grpc
      image: icr.io/cpopen/ibm-operator-catalog@sha256:5d606e4eb2b875e0b975f892e80343105ea5fb0d67f96e1400d77a715f6df72a
     updateStrategy:
        registryPoll:
          interval: 45m
  2. Update the ibm-operator-catalog source with the latest tag after the installation or upgrade completed.
    apiVersion: operators.coreos.com/v1alpha1
    kind: CatalogSource
    metadata:
      name: ibm-operator-catalog-old
      namespace: openshift-marketplace
    spec:
      displayName: IBM Operator Catalog
      publisher: IBM
      sourceType: grpc
      image: icr.io/cpopen/ibm-operator-catalog:latest
      updateStrategy:
        registryPoll:
          interval: 45m

IBM Storage Scale live events not working

Problem statement
IBM Storage Scale live events not working when enabling live events capability on the Data Connections menu on the Data Cataloging user Interface.
Resolution
Follow the steps to resolve the issue:
Note:
  • IBM Storage Scale live events capability on Data Cataloging requires IBM Storage Scale advanced edition license.
  • Edit the associated IBM Storage Scale connection from the Data Cataloging user interface on the Data Connections section to gather environment variables information.

  • Information for the route can be gathered using oc client logged in the OpenShift Container Platform cluster.
  1. Set the following environment variables on the IBM Storage Scale cluster and as well as Enable live events checkbox set on data source connection model at the Data Cataloging user interface.

    Log in to the OpenShift through the oc client and select ibm-data-cataloging namespace.

    DATASOURCE=<filesystem>
    KAFKA_BROKER_IP=$(oc get routes isd-kafka-tlsext-bootstrap -o jsonpath='{.spec.host}')
    KAFKA_EXT_BROKER_PORT_HTTPS=443
    SINK_TOPIC=scale-le-connector-topic
    SINK_AUTH_CONFIG=<working dir>/kafka/auth.file
  2. Run the following command on the IBM Storage Scale cluster after the variables are set on the environment.
    sudo /usr/lpp/mmfs/bin/mmwatch ${DATASOURCE} enable --event-handler kafkasink --sink-brokers "${KAFKA_BROKER_IP}:${KAFKA_EXT_BROKER_PORT_HTTPS}" --sink-topic ${SINK_TOPIC} --sink-auth-config ${SINK_AUTH_CONFIG} --events IN_ATTRIB,IN_CLOSE_WRITE,IN_MODIFY,IN_CREATE,IN_DELETE,IN_MOVED_FROM,IN_MOVED_TO

    After you run the command, then the live events are enabled for that particular IBM Storage Scale connection.

Data Cataloging service goes into terminating state during IBM Storage Fusion upgrade

Problem statement
The Data Cataloging service goes into a terminating state or stuck in upgrade during the IBM Storage Fusion upgrade.
Resolution
Follow the steps to resolve the issue:
  1. Run the following command to login to your server using the oc command line interface.
    oc login --token=<YOUR_TOKEN> --server=<YOUR_SERVER>
  2. Remove finalizers for every Kafka topic created in AMQ-streams.
    for kafkatopic in $(oc get kafkatopics.kafka.strimzi.io -n ibm-data-cataloging --no-headers | awk '{print $1}'); do oc patch kafkatopics.kafka.strimzi.io $kafkatopic -n ibm-data-cataloging --type='json' -p='[{"op": "remove", "path": "/metadata/finalizers"}]'; done