Analytics Toolkit API service performance troubleshooting
When you encounter issues with the response of the Analytics Toolkit service,
/api/atk
, it might be caused by a backup-restore process, or accumulated failures
in previous application programming interface (API) calls.
Symptoms
You do not receive Analytics Toolkit API responses within a reasonable time, or do not receive any Analytics Toolkit API response.
Causes
The root cause of the failures might be due to misconfigurations. However, the Analytics Toolkit service must be working correctly before you can call the APIs again with your corrected configurations. To help you resolve this issue, use the aitk_job support action.
Accumulated Analytics Toolkit job failures can result in some cluster resources not being released after use. This situation causes performance issues for the Analytics Toolkit.
Environment
The Analytics Toolkit is a powerful workflow engine that you use to chain Artificial Intelligence (AI) models and data ETL (extract, transform, load) steps together for meaningful predictions. For instance, use it to extract process commands from Universal Data Insights data, and predict if they are related to a cybersecurity threat. You might experience performance issues, slow response time, under the following conditions.
-
If the Analytics Toolkit encounters a failure due to misconfigurations of the ETL steps or the AI model parameters, some cluster resources might not be released after use.
-
If users created several continuous Analytics Toolkit workflows or jobs, and forget to delete the jobs manually, some cluster resources might be occupied continuously.
-
If the Analytics Toolkit service is restarted without deleting existing jobs, some cluster resources might not be released properly.
Diagnosing the problem
Install the command-line interface (CLI) utility cpctl from the cp-serviceability pod. For more information, see Installing the cpctl utility.
To find the cause of the performance issue in Analytics Toolkit, run the following aitk_job action. The parameter --token specifies the cluster administrator token. The token can be generated when you are logged in as an admin user by running the command: oc whoami -t.
cpctl remediation aitk_job --token <token>
This command results in a printout of all the jobs owned by the Analytics Toolkit. Use the jobs list to check for the following issues.
- If failed jobs are found after you run the support action, the performance issue is likely to be caused by the failed jobs. The failed jobs are usually due to misconfigurations.
- If inactive jobs are found after you run the support action, the performance issue is likely to be caused by inactive jobs that are not manually deleted by users. When users forget to delete inactive jobs, they might also forget to delete the job IDs for deletion.
- If out-of-date jobs are found after you run the support action and restart the Analytics Toolkit service, you must remove the jobs to restore performance.
The support action can be used to address these issues. Then, you can manually restart Analytics Toolkit service, for a full restoration of the Analytics Toolkit service.
Resolving the problem
The following table describes remediation tasks to remove problematic jobs.
Problem | Remediation |
---|---|
Failed jobs |
If failed jobs are found after you run the diagnosis support action, run the following aitk_job action. To generate the value for <token>, run the command: oc whoami -t.
This action removes all failed Analytics Toolkit jobs and improves the performance of the Analytics Toolkit service. If an error is found in a line in the run log for this action and the line starts with [INFO] instead of [ERROR], it means that the error is expected. This error results from a failed attempt to delete a job resource, which was already deleted. The purpose of aitk_job deletion is to free resources that were not freed correctly. The resource not found errors indicate that the resources occupied by the to-be-deleted jobs were freed correctly. Therefore, the performance issue has another cause and you must try the other remediation options. This concept applies to all job deletions listed in the following remediation instructions. |
Inactive jobs |
If inactive jobs are found after you run the diagnosis support action, run the following aitk_job action.
This action removes all inactive Analytics Toolkit jobs that are caused by continuous workflows and improves the performance of the Analytics Toolkit service. |
Out-of-date jobs |
If only out-of-date jobs are found after you run the diagnosis action after the Analytics Toolkit service is restarted, run the following aitk_job action.
This action removes all the Analytic Toolkit jobs, and helps to improve performance of the Analytics Toolkit service. To manually restart the Analytics Toolkit service, and delete all out-of-date jobs, run the following command.
If you know the exact job ID that is causing the performance issue is known, run the following aitk_job action.
This action removes all resources that are occupied by the problematic job and helps improve the performance of the Analytics Toolkit service. |
- Sample output
- The following output is an example of the result when you run the command
cpctl remediation aitk_job --delete <job_ID> --token <token>
Executing playbook aitk_job.yaml - localhost on hosts: localhost - Gathering Facts... localhost ok failed if no token... Set CP4S namespace... localhost ok get cluster host... localhost done host information... localhost ok: { "changed": false, "msg": "login endpoint: https://some.endpoint:6443" } login via regular token parameter... localhost done check current user... localhost done | stdout: kube:admin changing to cp4s project... localhost done | stdout: Already on project "cp4s" on server "https://some.endpoint:6443". Checking AITK... Running the playbook... localhost done | stdout: [INFO] 2021-03-04T02:46:50 Starting aitk_job with argument: c1711d70-d144-4baa-a6f2-75acf0a37b86 [INFO] 2021-03-04T02:46:53 Deleting service... service "isc-udi-analyzer-c1711d70-d144-4baa-a6f2-75acf0a37b86" deleted [INFO] 2021-03-04T02:46:54 Deleting job... job.batch "isc-udi-analyzer-c1711d70-d144-4baa-a6f2-75acf0a37b86" deleted [INFO] 2021-03-04T02:46:55 Deleting job data by id: c1711d70-d144-4baa-a6f2-75acf0a37b86, in namespace: cp4s... Deleting task secret: iscmxk4q... Deleting task configmap: isc6szl9... Deleting redis key: c1711d70-d144-4baa-a6f2-75acf0a37b86.json... Deleting etcd config all: c1711d70-d144-4baa-a6f2-75acf0a37b86... Deleting redis task: c1711d70-d144-4baa-a6f2-75acf0a37b86... Done Print Knowledge Center Runbook location... localhost ok: { "changed": false, }
User response:
To confirm that the response time is reasonable after problematic jobs are deleted, make any Analytics Toolkit API call.