Health check monitoring
Provides series of checkpoints to check IBM Data Cataloging health status.
It provides IBM Data Cataloging health status and different commands that might be used for continuous monitoring of the health check.
When scanning is running, different checkpoints can be made to ensure that they are scanned properly.
- Monitor connection manager and db2whrest pods frequently to avoid overhead on database
- One of the issues that might disrupt IBM Data Cataloging function is overrunning the database, causing malfunctions in different components such as the user interface or connection manager.
- Check request latency of a connection filtering db2whrest log
- This checkpoint is important for checking the time for a request query to be processed, the
following example uses one connection to see the time for the request to respond. Values < 2000
(milliseconds) are expected.
- This checkpoint is important for checking the time for a request query to be processed, the following example uses one connection to see the time for the request to respond. Values < 2000 (miliseconds) are expected.
- Run the following command to check the pod that is associated with
db2whrest.
$oc get pods | grep db2whrest
Example output:$oc get pods | grep db2whrest isd-db2whrest-7x574c7bf-c7l9t 1/1 Running 0 43h
- Get the logs and filter them.For example:
$oc logs isd-db2whrest-7x574c7bf-c7l9t|grep -E ‘${connectionName}.*requestLatency’ {"type": "AUDIT", "hostname": "172.17.47.184", "serverAddress": "172.21.166.189", "userAgent": "python-requests/2.31.0", "timestampStart": "2023-06-22 21:27:55+00:00", "request": "POST /db2whrest/v1/update_doc/connections/${connectionName}", "protocol": "HTTP/1.1", "requestId": "8fd142a5-5806-48a9-9872-1fa1eba227b1", "responseStatus": 200, "responseSize": 21, "requestLatency": 81, "auth": {"username": "bluadmin", "scheme": "basic"}, "principal": {"uuid": "896e5252-5299-422c-a72c-260d922bdf31", "type": "account"}, "service": {"node": "Unset", "service": "Unset"}}
- It is also acceptable if the requested latency value is less than 2000. This is an important checkpoint that can give you clues about database behavior when you submit a new scan.
- Monitor Kafka lag as complementary signal of a healthy product
- When a scan is being processed, one signal of a healthy environment is the lag, which is the difference between indexed records and scanned records.
- Check average insert rate of a scan and batch size
- The following checkpoints give clues to the health of the insert rate into the database and the
batch size of Kafka.
- Insert rate values might vary depending on different factors. However, the expected rate is around 200 messages per second. It means 200 messages per second get inserted into the database per consumer.
- Regarding the batch size of Kafka, it is the amount of data that is accumulated and sent as a batch by a Kafka producer. This value is expected to be around 50,000 overall. If the value decreases significantly for a long period, it might be a sign of misbehavior.
- Run the following command to get the consumer pod
details.
$oc get pods|grep consumer-file-scan
Note: In this case, the consumer is the file, because it is an NFS scan and should be 10 pods.Example output:isd-consumer-file-scan-6d6bbf8487-45c6c 1/1 Running 0 7d13h isd-consumer-file-scan-6d6bbf8487-6zfv9 1/1 Running 0 7d13h isd-consumer-file-scan-6d6bbf8487-77g7v 1/1 Running 0 7d13h isd-consumer-file-scan-6d6bbf8487-97cfv 1/1 Running 0 7d13h isd-consumer-file-scan-6d6bbf8487-bwmr5 1/1 Running 0 7d13h isd-consumer-file-scan-6d6bbf8487-c5cb9 1/1 Running 0 7d13h isd-consumer-file-scan-6d6bbf8487-ljkws 1/1 Running 0 7d13h isd-consumer-file-scan-6d6bbf8487-m96hc 1/1 Running 0 7d13h isd-consumer-file-scan-6d6bbf8487-w55lr 1/1 Running 0 7d13h isd-consumer-file-scan-6d6bbf8487-z8hqd 1/1 Running 0 7d13h
- Get the logs and filter them.For example:
$oc logs isd-consumer-file-scan-6d6bbf8487-45c6c|grep -A5 "Avg Insert" … omitted … DB Avg Insert Rate (msg/sec): { current: 210, min: 1, max: 245 } 2023-06-22 13:54:40.611 > offset_commit_cb: success, offsets:[{part: 1, offset: 79452963, err: none}] 2023-06-22 13:56:25.588 > Consumed= 50014 Dup_id= 13 Skipped= 0 |Last_Batch: Att= 50000 Succ= 50000 Fail= 0 |Total: Att= 50001 Succ= 50001 Fail= 0 | Batch_Time(ms): E= 160 B= 36155 L= 68659 KC= 0 TH= 0 Tot= 104974 DB Avg Insert Rate (msg/sec): { current: 192, min: 1, max: 245 } 2023-06-22 13:56:25.605 > offset_commit_cb: success, offsets:[{part: 1, offset: 79502976, err: none}] 2023-06-22 13:59:44.298 > Consumed= 100024 Dup_id= 23 Skipped= 0 |Last_Batch: Att= 50000 Succ= 50000 Fail= 0 |Total: Att= 100001 Succ= 100001 Fail= 0 | Batch_Time(ms): E= 139 B= 11888 L= 186663 KC= 1 TH= 0 Tot= 198691 DB Avg Insert Rate (msg/sec): { current: 196, min: 1, max: 245 } 2023-06-22 13:59:44.314 > offset_commit_cb: success, offsets:[{part: 1, offset: 79552986, err: none}] 2023-06-22 14:03:04.724 > Consumed= 150031 Dup_id= 30 Skipped= 0 |Last_Batch: Att= 50000 Succ= 50000 Fail= 0 |Total: Att= 150001 Succ= 150001 Fail= 0 | Batch_Time(ms): E= 155 B= 1594 L= 197868 KC= 1 TH= 0 Tot= 200408 DB Avg Insert Rate (msg/sec): { current: 206, min: 1, max: 245 }
- Check Qpart assignation when a new connection is submitted for scan
- This checkpoint shows that when a scan is assigned to the different partitions, the health
status must be associated with the correct assignment of the 10 Qparts (0–9), and
scangen must be constant.
- Run the following command to get the producer pod
details.
$oc get pods|grep producer-file-scan
Note: In this case, the producer is the file, because it is an NFS scan only one producer is acceptable.Example output:$oc get pods|grep producer-file-scan isd-producer-file-scan-79c45bbc77-jm5v6 1/1 Running 4 (7d14h ago) 7d14h
- Get the logs from one and filter the connection name and
COMMITSCAN
.For example:$oc logs isd-producer-file-scan-79c45bbc77-jm5v6|grep -E ${connectionName}.*COMMITSCAN' Built commit scan message [connection:${connectionName},eventname:COMMITSCAN,scangen:1,qpart:0] Built commit scan message [connection:${connectionName},eventname:COMMITSCAN,scangen:1,qpart:1] Built commit scan message [connection:${connectionName},eventname:COMMITSCAN,scangen:1,qpart:2] Built commit scan message [connection:${connectionName},eventname:COMMITSCAN,scangen:1,qpart:3] Built commit scan message [connection:${connectionName},eventname:COMMITSCAN,scangen:1,qpart:4] Built commit scan message [connection:${connectionName},eventname:COMMITSCAN,scangen:1,qpart:5] Built commit scan message [connection:${connectionName},eventname:COMMITSCAN,scangen:1,qpart:6] Built commit scan message [connection:${connectionName},eventname:COMMITSCAN,scangen:1,qpart:7] Built commit scan message [connection:${connectionName},eventname:COMMITSCAN,scangen:1,qpart:8] Built commit scan message [connection:${connectionName},eventname:COMMITSCAN,scangen:1,qpart:9]
In case 1, it is acceptable for the connection to be assigned to 10 different partitions and scangen to be constant.
- Run the following command to get the producer pod
details.
- Check connections progressing and scan in progress state when a new scan started
- These requests can be made to the connection manager API endpoint to check the status of a connection. It would be necessary to create a token to be ready to consume the API properly. For more information about how to get the token, see /auth/v1/token: GET.
- Check database records increasing while scan is progressing
- If querying the API or scanning progress on the user interface does not show any progress, it
might give the impression that the system is not responding properly. However, that might not be
accurate. It is necessary to establish a connection to the database and make some queries to ensure
that the system is actually making insertions into the proper table. Warning: The database is stressed when doing several transactions. Be careful when you query the database directly. Also, avoid doing complex queries when one or several scans are occurring; this might cause unnecessary overhead and impact the unusual behavior of the database.
- Run the following command to get db2 head pod and access
it.
$ headPodName=$(oc -n ${projectName} get po --selector name=dashmpp-head-0|grep isd|awk '{print $1}') $ oc -n ${projectName} rsh ${headPodName}
- Change user to db2inst1 and connect to bludb database.
sh-4.4$ su db2inst1 - [db2inst1@c-isd-db2u-0 - Db2U ]$ db2 connect to bludb Database Connection Information Database server = DB2/LINUXX8664 11.5.7.0 SQL authorization ID = DB2INST1 Local database alias = BLUDB #Do couple counting queries agains metaocean table and make sure it’s progressing leaving couple of seconds from one query to another [db2inst1@c-isd-db2u-0 - Db2U ]$ select count(fkey) from bluadmin.metaocean 1 --------------------------------- 10364. 1 record(s) selected. [db2inst1@c-isd-db2u-0 - Db2U ]$ select count(fkey) from bluadmin.metaocean 1 --------------------------------- 10952. 1 record(s) selected.
- Run the following command to get db2 head pod and access
it.