Health check monitoring

Provides series of checkpoints to check IBM Data Cataloging health status.

It provides IBM Data Cataloging health status and different commands that might be used for continuous monitoring of the health check.

When scanning is running, different checkpoints can be made to ensure that they are scanned properly.

Monitor connection manager and db2whrest pods frequently to avoid overhead on database
One of the issues that might disrupt IBM Data Cataloging function is overrunning the database, causing malfunctions in different components such as the user interface or connection manager.
The top memory and CPU consumption for connection manager and db2whrest pods is around 85% of their limits.
It is recommended to run the following commands to ensure that those limits are met for regular usage:
  1. Check the current usage of connmgr pod and db2whrest.
    $oc adm top pod isd-connmgr-795544f666-l25mc
    $oc adm top pod isd-db2whrest-7c574c7bf-x7l9t
    
    Example output:
    $oc adm top pod isd-connmgr-795544f666-l25mc
    NAME                                     CPU(cores)   MEMORY(bytes)
    isd-connmgr-scheduler-795544f666-l25mc   0m           34Mi
    
    $oc adm top pod isd-db2whrest-7c574c7bf-x7l9t
    NAME                            CPU(cores)   MEMORY(bytes)
    isd-db2whrest-7c574c7bf-x7l9t   21m          512Mi
    
  2. Compare current usage with the limits set and make sure that it is not greater than 85% for a long period of time.
    $oc describe pod isd-connmgr-795544f666-l25mc|grep -A2 Limits
        Limits:
          cpu:     1
          memory:  4Gi
    $oc describe pod isd-db2whrest-7c574c7bf-x7l9t|grep -A2 Limits
        Limits:
          cpu:     2
          memory:  8Gi
    
    Note: IBM Data Cataloging recommends having 10 parallel scans at most to avoid malfunction or reduce significantly performance on the product.
Check request latency of a connection filtering db2whrest log
This checkpoint is important for checking the time for a request query to be processed, the following example uses one connection to see the time for the request to respond. Values < 2000 (milliseconds) are expected.
  1. This checkpoint is important for checking the time for a request query to be processed, the following example uses one connection to see the time for the request to respond. Values < 2000 (miliseconds) are expected.
  2. Run the following command to check the pod that is associated with db2whrest.
    $oc get pods | grep db2whrest
    Example output:
    $oc get pods | grep db2whrest
    isd-db2whrest-7x574c7bf-c7l9t		1/1	Running		0	43h
    
  3. Get the logs and filter them.
    For example:
    $oc logs isd-db2whrest-7x574c7bf-c7l9t|grep -E ‘${connectionName}.*requestLatency’
    {"type": "AUDIT", "hostname": "172.17.47.184", "serverAddress": "172.21.166.189", "userAgent": "python-requests/2.31.0", "timestampStart": "2023-06-22 21:27:55+00:00", "request": "POST /db2whrest/v1/update_doc/connections/${connectionName}", "protocol": "HTTP/1.1", "requestId": "8fd142a5-5806-48a9-9872-1fa1eba227b1", "responseStatus": 200, "responseSize": 21, "requestLatency": 81, "auth": {"username": "bluadmin", "scheme": "basic"}, "principal": {"uuid": "896e5252-5299-422c-a72c-260d922bdf31", "type": "account"}, "service": {"node": "Unset", "service": "Unset"}}
    
  4. It is also acceptable if the requested latency value is less than 2000. This is an important checkpoint that can give you clues about database behavior when you submit a new scan.
Monitor Kafka lag as complementary signal of a healthy product
When a scan is being processed, one signal of a healthy environment is the lag, which is the difference between indexed records and scanned records.
A healthy state would be when the lag is less than 30%. A state of unhealthiness occurs when the number of scanned records continues to increase without progress in indexing, resulting in a significant lag and a sustained difference more than 30% for an extended period.
To resolve that situation, it is recommended to reduce the number of parallel scans, giving more time for the ingestion rate to increase properly and reducing the lag.
Check average insert rate of a scan and batch size
The following checkpoints give clues to the health of the insert rate into the database and the batch size of Kafka.
  1. Insert rate values might vary depending on different factors. However, the expected rate is around 200 messages per second. It means 200 messages per second get inserted into the database per consumer.
  2. Regarding the batch size of Kafka, it is the amount of data that is accumulated and sent as a batch by a Kafka producer. This value is expected to be around 50,000 overall. If the value decreases significantly for a long period, it might be a sign of misbehavior.
  3. Run the following command to get the consumer pod details.
    $oc get pods|grep consumer-file-scan
    Note: In this case, the consumer is the file, because it is an NFS scan and should be 10 pods.
    Example output:
    isd-consumer-file-scan-6d6bbf8487-45c6c                           1/1     Running     0               7d13h
    isd-consumer-file-scan-6d6bbf8487-6zfv9                           1/1     Running     0               7d13h
    isd-consumer-file-scan-6d6bbf8487-77g7v                           1/1     Running     0               7d13h
    isd-consumer-file-scan-6d6bbf8487-97cfv                           1/1     Running     0               7d13h
    isd-consumer-file-scan-6d6bbf8487-bwmr5                           1/1     Running     0               7d13h
    isd-consumer-file-scan-6d6bbf8487-c5cb9                           1/1     Running     0               7d13h
    isd-consumer-file-scan-6d6bbf8487-ljkws                           1/1     Running     0               7d13h
    isd-consumer-file-scan-6d6bbf8487-m96hc                           1/1     Running     0               7d13h
    isd-consumer-file-scan-6d6bbf8487-w55lr                           1/1     Running     0               7d13h
    isd-consumer-file-scan-6d6bbf8487-z8hqd                           1/1     Running     0               7d13h
    
  4. Get the logs and filter them.
    For example:
    $oc logs isd-consumer-file-scan-6d6bbf8487-45c6c|grep -A5 "Avg Insert"
    … omitted …
    DB Avg Insert Rate (msg/sec): { current: 210, min: 1, max: 245 }
    2023-06-22 13:54:40.611 > offset_commit_cb: success, offsets:[{part: 1, offset: 79452963, err: none}]
    2023-06-22 13:56:25.588 > Consumed= 50014 Dup_id= 13 Skipped= 0 |Last_Batch: Att= 50000 Succ= 50000 Fail= 0 |Total: Att= 50001 Succ= 50001 Fail= 0 | Batch_Time(ms): E= 160 B= 36155 L= 68659 KC= 0 TH= 0 Tot= 104974
     DB Avg Insert Rate (msg/sec): { current: 192, min: 1, max: 245 }
    2023-06-22 13:56:25.605 > offset_commit_cb: success, offsets:[{part: 1, offset: 79502976, err: none}]
    2023-06-22 13:59:44.298 > Consumed= 100024 Dup_id= 23 Skipped= 0 |Last_Batch: Att= 50000 Succ= 50000 Fail= 0 |Total: Att= 100001 Succ= 100001 Fail= 0 | Batch_Time(ms): E= 139 B= 11888 L= 186663 KC= 1 TH= 0 Tot= 198691
     DB Avg Insert Rate (msg/sec): { current: 196, min: 1, max: 245 }
    2023-06-22 13:59:44.314 > offset_commit_cb: success, offsets:[{part: 1, offset: 79552986, err: none}]
    2023-06-22 14:03:04.724 > Consumed= 150031 Dup_id= 30 Skipped= 0 |Last_Batch: Att= 50000 Succ= 50000 Fail= 0 |Total: Att= 150001 Succ= 150001 Fail= 0 | Batch_Time(ms): E= 155 B= 1594 L= 197868 KC= 1 TH= 0 Tot= 200408
     DB Avg Insert Rate (msg/sec): { current: 206, min: 1, max: 245 }
Check Qpart assignation when a new connection is submitted for scan
This checkpoint shows that when a scan is assigned to the different partitions, the health status must be associated with the correct assignment of the 10 Qparts (0–9), and scangen must be constant.
  1. Run the following command to get the producer pod details.
    $oc get pods|grep producer-file-scan
    Note: In this case, the producer is the file, because it is an NFS scan only one producer is acceptable.
    Example output:
    $oc get pods|grep producer-file-scan
    isd-producer-file-scan-79c45bbc77-jm5v6                           1/1     Running     4 (7d14h ago)   7d14h
    
  2. Get the logs from one and filter the connection name and COMMITSCAN.
    For example:
    $oc logs isd-producer-file-scan-79c45bbc77-jm5v6|grep -E ${connectionName}.*COMMITSCAN'
    Built commit scan message [connection:${connectionName},eventname:COMMITSCAN,scangen:1,qpart:0]
    Built commit scan message [connection:${connectionName},eventname:COMMITSCAN,scangen:1,qpart:1]
    Built commit scan message [connection:${connectionName},eventname:COMMITSCAN,scangen:1,qpart:2]
    Built commit scan message [connection:${connectionName},eventname:COMMITSCAN,scangen:1,qpart:3]
    Built commit scan message [connection:${connectionName},eventname:COMMITSCAN,scangen:1,qpart:4]
    Built commit scan message [connection:${connectionName},eventname:COMMITSCAN,scangen:1,qpart:5]
    Built commit scan message [connection:${connectionName},eventname:COMMITSCAN,scangen:1,qpart:6]
    Built commit scan message [connection:${connectionName},eventname:COMMITSCAN,scangen:1,qpart:7]
    Built commit scan message [connection:${connectionName},eventname:COMMITSCAN,scangen:1,qpart:8]
    Built commit scan message [connection:${connectionName},eventname:COMMITSCAN,scangen:1,qpart:9]
    

    In case 1, it is acceptable for the connection to be assigned to 10 different partitions and scangen to be constant.

Check connections progressing and scan in progress state when a new scan started
These requests can be made to the connection manager API endpoint to check the status of a connection. It would be necessary to create a token to be ready to consume the API properly. For more information about how to get the token, see /auth/v1/token: GET.

At this stage, the token is saved in the ${TOKEN} variable, hostRoute and connectionName are identified, and jq is installed for JSON formatting purposes.

For example:
$ curl -q -H "Authorization: Bearer $TOKEN" https://<<hostRoute>/connmgr/v1/connections/<<connectionName>>|jq
{
  "name": "exports",
  "platform": "NFS",
  "cluster": "nfs-1.nfs.svc.cluster.local",
  "datasource": "exports",
  "current_gen": 3,
  "password": null,
  "site": "",
  "online": 1,
  "scan_topic": "file-scan-connector-topic",
  "le_topic": "file-le-connector-topic",
  "le_enabled": 1,
  "host": "nfs-1.nfs.svc.cluster.local",
  "mount_point": "/exports",
  "protocol": "nfs",
  "user": null,
  "additional_info": "{\"working_dir\": \"/nfs-scanner/\", \"local_mount\": \"/nfs-scanner/da68f4b426134c5d9fb498a15219857f\"}",
  "scan_in_progress": 1,
  "total_records": 10162,
  "scan_compl_rec": 0,
  "schedule": null
}
As shown, there is a scan in progress for that specific connection, and total records must be augmenting over time. It would be necessary to give some seconds to get data reflected on the endpoint.
These are some of the checkpoints that can be used for checking the state of the connections. This was one example of the many endpoints that can be consumed by the IBM Data Cataloging API. For more information, see REST API for IBM Data Cataloging.
Check database records increasing while scan is progressing
If querying the API or scanning progress on the user interface does not show any progress, it might give the impression that the system is not responding properly. However, that might not be accurate. It is necessary to establish a connection to the database and make some queries to ensure that the system is actually making insertions into the proper table.
Warning: The database is stressed when doing several transactions. Be careful when you query the database directly. Also, avoid doing complex queries when one or several scans are occurring; this might cause unnecessary overhead and impact the unusual behavior of the database.
  1. Run the following command to get db2 head pod and access it.
    $ headPodName=$(oc -n ${projectName} get po --selector name=dashmpp-head-0|grep isd|awk '{print $1}')
    $ oc -n ${projectName} rsh ${headPodName}
    
  2. Change user to db2inst1 and connect to bludb database.
    sh-4.4$ su db2inst1 -
    [db2inst1@c-isd-db2u-0 - Db2U ]$ db2 connect to bludb
    
       Database Connection Information
    
     Database server        = DB2/LINUXX8664 11.5.7.0
     SQL authorization ID   = DB2INST1
     Local database alias   = BLUDB
    
    #Do couple counting queries agains metaocean table and make sure it’s progressing leaving couple of seconds from one query to another
    
    [db2inst1@c-isd-db2u-0 - Db2U ]$ select count(fkey) from bluadmin.metaocean
    
    1
    ---------------------------------
                               10364.
    
      1 record(s) selected.
    
    [db2inst1@c-isd-db2u-0 - Db2U ]$ select count(fkey) from bluadmin.metaocean
    
    1
    ---------------------------------
                               10952.
    
      1 record(s) selected.