End-to-end system performance root cause analysis

This tutorial guides you through the end-to-end scenario about how to perform root cause analysis by analyzing data from multiple data providers. In this tutorial, you will perform root cause analysis of a performance defect that is found in the zMobile Health Care application

Prerequisite

Before you begin this tutorial, you need to perform the following activities.
Note: Since the data in this tutorial is the generated sample data, you might not see the exact same data on your reports.

Trish is a test lead who is testing the response time of the Inquire Patient feature of the zMobile Health Care application. During the testing, Trish found that in the latest development build, the response time of the Inquire Patient feature increased to 2 minutes from less than a second in the previous build. She reported it to you, the development lead. And you investigate the issues by consulting dashboards and reports in ADI.

As a development lead, you need to complete the following steps to perform the root cause analysis for the performance issue:
  1. Navigate your browser to IBM ADDI Extension home page https://healthcare.example.com:9753/addi/web/workbook.
  2. Log in with the following credentials. After you log in, the Workbooks page is displayed.
    • Email address: adiadmin@healthcare.example.com
    • Password: adiadmin

    You can see the dashboard that summarizes the overall status of zMobile Health Care application. You can notice the problem that Trish reported right away. The average response time of zMobile Health application is almost 7 seconds. It is higher than the service level agreement for your company. And you can notice that the code coverage percentage is also in the warning area.

    Figure shows the dashboard that summarizes the overall status of zMobile Health Care application.
  3. Click the name of zMobile Health Care to view detailed analysis. The zMobile Health Care page loads with summary charts of the contributing providers.
  4. View the reports on the workbook summary view of zMobile Health Care page. The reports are organized into 3 tabs: Code Coverage, Static Analysis, and Performance. All the reports are analyzed based on three data providers that are associated with this analysis workbook:
    • A Manual Builds data provider that collects the code coverage results.
    • An Application Discovery data provider that collects static analysis data about transactions, programs and project-level metrics in the zMobile Health Care application.
    • An OMEGAMON for CICS data provider that monitors performance of the zMobile Health Care application CICS transactions.
    With reports from multiple providers in one workbook, you can correlate different data from different sources.
    Figure shows the Static Analysis tab of zMobile Health Care workbook summary.
  5. Select Static Analysis tab and Performance tab to navigate to static analysis reports and performance reports respectively.
    Figure shows the Static Analysis tab of zMobile Health Care workbook summary.
    Figure shows the Performance tab of zMobile Health Care workbook summary.

    On the Performance tab, you can see the detailed analysis of program and transactions within the zMobile project that displays along with the performance reports of HTRANS.

    HTRANS is a service class in OMEGAMON for CICS data provider. Service class is a concept that OMEGAMON for CICS uses to allow grouping related transactions together. If no service class is defined, OMEGAMON for CICS would define a default service class.

    In the Demo OMEGAMON provider: HTRANS section, you can investigate the performance issue that Trish reported. Besides the average response time that is in the red zone, you notice that on the Transactions with Performance and Reliability Issues table, HCP1 transaction that is run by the *Inquire Patient* feature is flagged as exceeding the goal response time threshold by more than 400%. Average execution count of HCP1 is also much higher than the other transactions. Then you think that this transaction probably causes the application to slow down.
    Note: Goal response time threshold is the setting in OMEGAMON for CICS to indicate the severity of response time that exceeds the standard service level agreement that the company requires. The different warning icons on the Transactions with Performance and Reliability Issues table indicate different levels of severity. For example, the transparent red warning icon indicates that the response time exceeds threshold by 400%, and the opaque red warning icon indicates that the response time exceeds threshold by more than 400%.
  6. Click the HCP1 transaction on the Transactions with Performance and Reliability Issues table to view the detailed transaction analysis. You now see the history of the average response time comparing to Average DB2 and File I/O wait time of response time, average CPU time, and execution count.

    On the header of the report, the maximum number of the execution count, CPU time, and response time for a given timeframe display on this view.

    From the chart, you notice that the average response time was constantly rising high for the past 3 days from 2 seconds to 26 seconds. The same pattern occurs to DB2 wait time. DB2 wait time went up from about 20 - 30% to over 70% which is higher than the suggested level 50% that is indicated by the red line on the Average DB2 and File I/O Wait Time of Response Time.

    You notice that the execution is always at about almost 200 which means this transaction has been executed frequently. You suspect that there probably are some changes made to the program within this transaction 3 days ago.

    Figure shows the detailed transaction analysis for HCP1 transaction.
    (Continued)Figure shows the detailed transaction analysis for HCP1 transaction.
  7. Hover over the chart on the area where the response time exceeds the threshold and the average DB2 time is high. You can see response time value and percent DB2 wait time value.
    Figure shows the response time value that appears when you hover over the chart.
    (Continued)Figure shows the response time value that appears when you hover over the chart.
  8. Click OK to do further analysis.
  9. Click Artifact Composition button on the upper right corner to go to the Artifact Composition view.
    Figure shows the Artifact Composition button.
    Figure shows the detailed Artifact Composition analysis.
    Artifact Composition analysis is the analysis of key program and transaction metrics collected from Application Discovery. This view shows Artifact Composition graph which is the connected graph of all artifacts (transactions, programs, and database tables) calling in a transaction. The arrow direction represents the calling direction.
  10. Select the Warning icon ( ) on the upper right of the header area to display the risk areas.
    Figure shows the risk areas.
    You notice the warning icons on top of the artifacts. The blue warning icon indicates that there could be some changes to the artifacts. Because there are changes to artifact metrics during the interval time that you have set when you set up an analysis collection in the previous tutorial. In this case, it is 7 days which is between October 13th and October 28, 2020. For ADDI Extension, the changed artifact is indicated by one of the following method.
    • Check if there is the LASTUPDATED metric that is collected from IBM Application Discovery.
    • Check if the Maintainability Index and Number of Source Line of Code have been updated between the change interval time you have defined.

    The red warning icon indicates that the artifact is risky to modify. The Risk Areas dialog box on top of the chart explains the meaning of blue and red icons with the threshold setting for key metrics.

  11. You notice the warning icons on top of the artifacts. The blue warning icon indicates that there could be some changes to the artifacts. Because there are changes to artifact metrics during the interval time that you have set when you set up an analysis collection in the previous tutorial. In this case, it is 7 days which is between February 28th and March 7, 2018. For ADI, the changed artifact is indicated by one of the following method.
    • Check if there is the LASTUPDATED metric that is collected from IBM Application Discovery.
    • Check if the Maintainability Index and Number of Source Line of Code have been updated between the change interval time you have defined.

    The red warning icon indicates that the artifact is risky to modify. The Risk Areas dialog box on top of the chart explains the meaning of blue and red icons with the threshold setting for key metrics.

  12. Click X on top right of the Risk Areas dialog to close the dialog box. The Risk Areas dialog disappears along with the warning icons.
  13. Click the Filters drop-down list and select DB2. The Artifact Composition graph is updated to show only artifacts that are connected to DB2.
    Figure shows the drop-down box of Filters.
    Figure shows the Artifact Composition graph after the DB2 filter is applied.
  14. Click the Risk Areas icon () on the top menu. The Risk Areas dialogue displays with the warning icons on the Artifact Composition graph.

    You can see the blue warning icon on HCIPDB01. This means that there could be some changes made to HCIPDB01 in the past week.

  15. Close the Risk Areas dialogue and click HCIPDB01 box. The key metrics collected from Application Discovery for this artifact are displayed. You can see that the Maintainability Index is in red area but it is not your concern at this point.

    You can notice that the code coverage is 82%. When you hover over the question mark on the Test Code Coverage metric for HCIPDB01, you see that the date of the last test code coverage data is a week from today, which is before the date when the program was changed last time. This means that it has not been properly tested.

    Figure shows the key metrics collected from Application Discovery for HCIPDB01.
  16. Click View Trends link next to the Metrics header. The Trends page for HCIPDB01.cbl appears.
    Figure shows the Trends page of HCIPDB01.
  17. Click Daily to change the view from Weekly view to Daily view. Observe the Trends page that shows the trend lines of Maintainability Index and Source Lines of Code. You notice that Source Lines of Code went up about 4 days ago.
    Figure shows the page of Trends.
  18. Select the drop-down list of Maintainability Index and select Total Statements. You notice that both metrics have the same trending. About 40 statements are added to HCIPDB01.cbl.
    Figure shows the page of Trends on which the metrics of Maintainability Index and Total Statements have the same trending.

    Now you as the development lead conclude that the HCIPDB01 should be the area of concern. You discuss with Jane, the developer. You two investigate further into the code and find that an extra SQL Join statement contributes to the excessive wait time. Before making modifications, you and Jane go back to ADDI Extension.

    Figure shows the page of HCIPDB01.
  19. Select Artifact Composition link on the breadcrumb to go back to the Artifact Composition table.
  20. Click the Table () icon on the top header to update the view to table view.
    Figure shows the Table view.
    The Artifact Composition table shows the list of all program artifacts within a transaction. The table is sorted based on the analysis of high risk to modify. The rank of risk to modify is calculated based on combination of the following criteria:
    • Whether the artifact is in risk areas: Risk areas are based on the thresholds setting.
    • Whether the artifact is an outlier: Outlier artifacts are analyzed based on the outlier detection algorithm leveraging Mahalanobis distance of each artifact to the center of the distribution. Artifacts with large Mahalanobis distances mean that they are outlier or abnormal because they are very far away from majorities.
    • Metrics values sorting: Given the same risk area and outlier status, artifacts are finally sorted by the values of Maintainability Index, Unreachable Code, and Cyclomatic Complexity.
  21. On the Artifact Composition table, you and Jane notice that HCIPDB01 is at medium risk to modify. Another program HCAZERRS is at high risk. Hence making another changes would not make much impact to the zMobile Health Care application.
    Figure shows the table of Artifact Composition.
    You can click on the Risk Areas icon () to update the threshold settings and see the effect of the changes. For example, move the lower slider of Maintainability Index threshold to 40%. You now see that the warning icons of the programs with Maintainability Index higher than 40% are changed from red warning icons to transparent yellow warning icons.
    Figure shows the Artifact Composition view on which the slider of Maintainability Index threshold is set to 40%.
  22. Click the Graph View () icon on the top right menu to go back to the Artifact Composition graph view.
  23. Click on the HCIPDB01 box on Artifact Composition graph view to only shows the relationship of artifacts related to HCP1BA01.
  24. Click Close (X) icon to close the information dialog box to get the full view of the Artifact Composition graph.
    Figure shows the full view of Artifact Composition graph.

    You and Jane notice that HCIPDB01 has dependency with HCAZERRS which has high risk to modify. These two programs have a very strong impact. Jane needs to test both module very well to make sure that after fixing the SQL problems, her changes to HCIPDB01 would not cause any regressions to HCAZERRS.

  25. Click the Heat Map drop-down list on the top and select Maintainability Index.
    Figure shows the Heat Map drop-down list.
    The nodes in the graph are displayed with color. The color indicates the threshold status of metric that you have selected. For example, HCIPDB01 and HCAZERRS both are in red, which indicates that their maintainability index is in an unacceptable value. You need to pay close attention when you modify the code.
  26. Click the Close (X) icon to close the HCIPDB01 information dialog box to get full view of the graph.
    Figure shows the full view after you click the Close (X) icon.
  27. Click the Heat Map drop-down list again and select Code Coverage – Health Care Manual Builds.
  28. Click the Close (X) icon to close the HCIPDB01 information dialog box to get full view of the graph.
    Figure shows the updated full view after you update the Heat Map selections.

    The graph updates with the colored nodes that indicate the code coverage status of each node in the graph. The nodes without color indicate that no metric data is associated with those nodes.

    You can see that the code coverage status of HCAZERRS is in poor (yellow) and HCP1BI01 is in insufficient (red) areas. Both have dependency with HCIPDB01. This means that after making changes, the development team needs to make sure that both programs are fully tested.

Jane then performs the fix as discussed with you. In the next build after the fixing, Trish reports that the problem with the "Inquire Patient" feature is fixed and it again takes only 2 seconds to respond.

You have seen how you can correlate performance data from OMEGAMON for CICS and static analysis data from Application Discovery to perform the root cause analysis of the performance issues. IBM ADDI Extension facilities you to identify the potential programs that cause the issues without going through the entire artifacts within transactions. With Application Discovery data, it also recommends the Risk Areas where you need to take extra care when you modify them. You can see the dependency of artifacts to ensure that there would be no regressions in the future builds.