Enabling monitoring views for deployed integration runtimes on Red Hat OpenShift

If you are running on Red Hat OpenShift with a Prometheus stack configured, you can enable monitoring views to display CPU and memory usage metrics for your integration runtimes, and flow runs and latency metrics for your deployed integrations. This data lets you see how much CPU and memory your integration runtimes are using and how long it takes for your integrations to run. The data is displayed in the Monitor page in the App Connect Dashboard.

Availability: The ability to enable monitoring views is available only for App Connect Dashboard instances with a spec.version value that resolves to 13.0.1.0-r2 or later.

Before you begin

To enable monitoring views for your integration runtimes, the following prerequisites must be met:

  • Ensure that a Prometheus stack is configured in your Red Hat OpenShift cluster.

    Red Hat OpenShift provides a preconfigured monitoring stack that is based on the Prometheus open source project, and which you can use to monitor the core platform components and to enable monitoring of user-defined projects. For more information, see Enabling the OpenShift monitoring stack. See also Configuring core platform monitoring in the Red Hat OpenShift documentation.

  • Ensure that you have cluster administrator authority or have been granted the appropriate role-based access control (RBAC).
  • Ensure that the required command-line interface (CLI) tools are installed on your computer to enable you to use the CLI to log in to your cluster and run commands to create and manage your IBM® App Connect resources. For more information, see Installing the command-line tools for your cluster.

About this task

To enable monitoring views for your integration runtimes, you need to first grant permissions that enable your App Connect Dashboard instance to access Prometheus. You can then use the Monitor page in your Dashboard to view monitoring data for your deployed integration runtimes and their underlying containers and integrations.

Note: If you try to access the Monitor page in your Dashboard before granting the requisite cluster-wide permissions for monitoring views, you see a message, which informs you that the setup steps to enable monitoring haven't been completed in the cluster. No monitoring data is shown for your integration runtimes and integrations until this setup is complete.
Monitor page message indicating that setup steps need to be completed

The Monitor page presents data in two separate views (or tabs).

Runtimes

The Runtimes view of the Monitor page shows CPU and memory usage data for your integration runtimes and the containers that are created to support the deployed integrations in the runtime pods. These containers are specific to the integration type:

  • The runtime container is deployed to provide runtime support for Toolkit integrations or Designer integrations.
  • The designerflows container is deployed to support API flows in Designer integrations. This container also hosts connectors for event-driven and API flows.
  • The designereventflows container and an accompanying proxy container are deployed to support event-driven flows in Designer integrations.

If you request multiple replica pods while creating an integration runtime, each replica also has its own containers.

The Runtimes view provides an insight into the largest and smallest consumers of CPU and memory. Container metrics for the top five integration runtimes with the highest or lowest CPU and memory usage are presented within graphs, and container metrics for all your integration runtimes are additionally displayed in a table. You can filter the type of data that you see, select a time period for which you want to collect data, and drill down to more detailed information about your resources. The Runtimes view can be useful for identifying whether any of the containers is nearing its limits or is under-resourced, and determining whether you need to take remedial action; for example, by adjusting the relevant spec.template.spec.containers[].resources.* settings for your integration runtimes. The data could also help you identify memory leaks with certain flows that are running, or to identify whether any integration runtime that should be executing tasks isn't currently processing any data.

Sample data in the Runtimes tab on the Monitor page
Integrations

The Integrations view of the Monitor page enables you to monitor the integrations that are deployed to your integration runtimes. This view can be useful for identifying your busiest integrations and can help you determine where you might want to allocate more resources, perhaps during an influx of traffic. The Integrations view displays the following data for your deployed integrations:

  • Highest or lowest total flow runs

    A flow run occurs whenever part of a flow is triggered. For example, in the following Designer flow for a deployed integration, a Salesforce New case event triggers the flow to complete three actions. Each time a new case is created in Salesforce, it triggers the flow to complete one or more of its actions, which counts as one flow run.

    A Designer flow that contains a Salesforce "New case" event that triggers three actions
  • Highest or lowest average latency

    Latency is the time that elapses between when a flow is triggered and when it completes processing.

In the Integrations view, metrics for the top five integrations with either the highest total flow runs and average latencies, or with the lowest total flow runs and average latencies are presented within graphs. Metrics for all your integrations are additionally displayed in a table. You can filter the type of data that you see, select a time period for which you want to collect data, and drill down to more detailed information about your integrations.

Sample data in the Integrations tab on the Monitor page

Complete the following tasks:

  1. Granting the requisite cluster-wide permissions to your App Connect Dashboard
  2. Monitoring data for your runtimes and integrations from the App Connect Dashboard

Granting the requisite cluster-wide permissions to your App Connect Dashboard

To enable monitoring views for the deployed integration runtimes in your App Connect Dashboard instance, you need to first grant additional cluster-wide permissions to the Dashboard's service account in the cluster. These permissions enable access to Prometheus. You grant the permissions by creating a ClusterRoleBinding resource to bind an existing ClusterRole resource that is named cluster-monitoring-view.

Procedure

To create the ClusterRoleBinding resource by using the Red Hat OpenShift CLI, complete the following steps.

  1. From your local computer, create a YAML file with the following details, where:
    • metadata.name is a unique name for your ClusterRoleBinding resource. For example, you can use the format dashboardName-dashboardNamespaceName-cr-monitoring-view).
    • subjects.name is the name of your App Connect Dashboard (for example, db-fd-acecclic) with -dash appended.
    • subjects.namespace is the namespace (or project) where your App Connect Dashboard is deployed.
    kind: ClusterRoleBinding
    apiVersion: rbac.authorization.k8s.io/v1
    metadata:
      name: dashboardName-dashboardNamespaceName-cr-monitoring-view
    subjects:
      - kind: ServiceAccount
        name: dashboardName-dash
        namespace: dashboardNamespaceName
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: cluster-monitoring-view

    Example:

    kind: ClusterRoleBinding
    apiVersion: rbac.authorization.k8s.io/v1
    metadata:
      name: db-fd-acecclic-ace-fiona-cr-monitoring-view
    subjects:
      - kind: ServiceAccount
        name: db-fd-acecclic-dash
        namespace: ace-fiona
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: cluster-monitoring-view
  2. Save this file with a .yaml extension; for example, crb_clustermonview_cr.yaml.
  3. From the command line, log in to your Red Hat OpenShift cluster by using the oc login command.
  4. Run the following command to create the ClusterRoleBinding resource. (Use the name of the .yaml file that you created.)
    oc apply -f crb_clustermonview_cr.yaml
    Tip: You can also create the ClusterRoleBinding resource from the Red Hat OpenShift web console by clicking User Management > RoleBindings in the navigation menu, and then clicking Create binding.
    Example of the Cluster Role Binding fields in the Red Hat OpenShift web console

What to do next

Use the Monitor page in the Dashboard to view monitoring data for your runtimes and integrations.

Monitoring data for your runtimes and integrations from the App Connect Dashboard

From your App Connect Dashboard instance, you can track the CPU and memory usage for your runtimes and also view flow runs and latency metrics for the deployed integrations.

Procedure

To monitor data for your runtimes and integrations, complete the following steps:

  1. From the App Connect Dashboard, click Monitor Monitor icon. in the navigation pane to open the Monitor page.

    CPU and memory usage is shown on the Runtimes tab, and flow runs and latency metrics are shown on the Integrations tab.

    • By default, data is shown for the last hour, but you can use the time period selector to select a different time period that falls within the last 24 hours. You might find it helpful to adjust the time period to see whether you can gain useful insights about the data that you are monitoring. For example, if you notice that the CPU usage for a container is very high, while memory usage seems normal, switching from the last hour to the last three hours might help you spot patterns that account for the CPU spike.
      List of time periods in the time period selector
    • To see the time when the data was last updated, hover over the Last updated value.
      Hover over the "Last updated" value to display a tooltip that shows when the data was last updated
    • To update the data, click Refresh Refresh icon.
  2. To monitor CPU and memory usage, ensure that the Runtimes tab is selected.

    CPU and memory metrics are separately presented for the top five integration runtimes with the highest or lowest CPU usage, and the top five integration runtimes with the highest or lowest memory usage. (You can choose whether to view the highest or lowest usage as described later.) The CPU usage and Memory usage graphs show usage percentages per container in each integration runtime. This data pertains to your selected time range. The Highest xxx usage in selected time range or Lowest xxx usage in selected time range lists, which are adjacent to each graph, identify which container in each integration runtime uses either the highest or lowest percentage of CPU or memory. (If you have an App Connect Designer instance in the same namespace as your Dashboard, you might see metrics for the internal integration runtime (named designerName-designer), which is automatically provisioned for the Designer instance to support the built-in test facility for API flows.)

    The following image shows sample data for the CPU usage and Memory usage graphs.

    CPU usage and Memory usage graphs on the Runtimes tab
    Note:
    • Data is shown for integration runtimes if their containers started successfully within the selected time period. Therefore, you might also see data for deleted runtimes if they ran during the selected time period.
    • If an integration runtime has replicas, the summary data includes the total usage for all containers across the runtime and its replicas. You might notice missing or inaccurate values, particularly when your runtime has multiple replicas. When you use replicas for high availability, monitoring data can be missed when a different replica starts. If you believe that data is missing, check the logs for errors in your flows and monitor the CPU and memory data for the runtime.
    • Depending on the workload, CPU usage for a container can briefly exceed the usage limit, which results in a usage value over 100%. In this situation, CPU throttling can occur (to limit the CPU that the container can use) until CPU usage falls back within the limit. Throttling can affect performance and runtime stability. Therefore, make sure that you configure your integration runtime with sufficient resources for your workload.
    • The highest and lowest usage values that are shown for containers might vary depending on the time period that you select because data is aggregated at different intervals for each time period.

    To analyze the metrics, complete the following steps:

    1. Drill down into the data from the graphs in the following ways.
      • By default, the graphs show the container metrics for the top five integration runtimes with the highest CPU usage or memory usage, but you can also use the usage selector for each graph to show metrics for the lowest CPU usage or memory usage.
        "Usage" selector for selecting the highest or lowest usage
      • By default, the graphs show data for all container types that are deployed for the integration runtimes. To show the data for only one type of container, click the label for the container in the legend. You can click that label again to show data for all container types.
        Clicking the label for a container in the legend
      • To see data about container usage, hover over a bar in the graph. You can see the percentage usage of the container, the highest CPU or memory usage (in cores or MB), and the CPU or memory usage limit (in cores or MB).
        Hovering over a bar in the graph to see the data about container usage in a tooltip
    2. Scroll down to the table that lists metrics for the containers in all of your integration runtimes, including those that are not shown in the graphs. For each container in an integration runtime, the maximum CPU usage and maximum memory usage are shown.
      • You can click the headers in the table to sort by resource name or maximum resource usage.
      • If more than 10 items (or rows) are displayed in the table, you can use the Items per page selector to choose how many entries (up to a maximum of 400) to show on each page of the table.
        Depiction of the items per page selector
      • To view more detailed usage data for any container, expand the row for that container.

        For CPU usage, you can see the maximum number of cores that the container used during the selected time period. You can also see the lowest number of cores and the average usage during the time period. For memory usage, you can see the maximum, lowest, and average usage in MB for the time period.

      The screenshot shows the table of containers on the Runtimes tab and shows that you can expand each row to see more data.
    3. View detailed usage metrics for the containers in a single integration runtime, and for the selected time period.
      1. Click the name of an integration runtime in the usage list that is adjacent to each graph, or click the name of an integration runtime in the table.
        Depicts the name of an integration runtime in the "usage" list or in the table

        The first two graphs (labeled CPU usage and Memory usage) show the CPU usage in cores and memory usage in MB across all containers in the integration runtime, at different points in time. The Values for selected time range lists show the highest usage, usage limit, maximum percentage, and average usage. Next, you see graphs that display flow runs and latency data for the integrations that are running in your selected integration runtime. Finally, you see a table that lists the total flow runs and average latency for each integration.

        In the following example, the selected integration runtime contains a single runtime container, as depicted in the CPU usage and Memory usage graphs.
        Depicts detailed usage data within graphs for the containers in a runtime for a selected time period
      2. To see the CPU usage or memory usage data for an individual container, click the name of the container in the legend of the CPU usage or Memory usage graph. The following image shows an example of a runtime with multiple containers.
        Graphs that depict the CPU and memory usage across multiple containers

      3. To return to the data for all integrations runtimes, click Monitor in the breadcrumbs in the page header.
  3. To monitor flow runs and latency metrics, ensure that the Integrations tab is selected.

    Flow runs and latency metrics are separately presented for the top five integrations with the highest or lowest values. (You can choose whether to view the highest or lowest values as described later.) The Flow runs graph plots the data points for the top five integrations with either the highest or lowest flow runs in the selected time range. The Latency graph plots the data points for the top five integrations with either the highest or lowest latency (in milliseconds) in the selected time range. The Totals for selected time range list, which is adjacent to the Flow runs graph, identifies which deployed integration (for an integration runtime) has either the highest or lowest total flow runs. The Highest average latencies for selected time range or Lowest average latencies for selected time range list, which is adjacent to the Latency graph, identifies which deployed integration (for an integration runtime) has either the highest or lowest average latency. (If you have an App Connect Designer instance in the same namespace as your Dashboard, you might see metrics for started or invoked flows in your Designer instance. You might also see metrics for a JDBC Toolkit flow, which internally provides support for the JDBC connector in Designer, and is deployed as an integration (named jdbcconnector) in the pod that contains the Designer runtime container.)

    The following image shows sample data for the Flow runs and Latency graphs.

    Flow runs and latency graphs on the Integrations tab
    Note:
    • When many flow runs occur in a short time, data points might be combined on the Flow runs graph. For example, if 5 flow runs occurred within 5 minutes, one data point on the graph can represent those 5 flow runs. The average latency is calculated by dividing the total latency for all 5 flow runs by the number of flow runs.
    • The latency values in the Latency graph and in the table might not match because the averages are calculated in different ways. On the graph, average latency is calculated by dividing the total number of flow runs by the total amount of latency at each point in time. The table shows each individual latency value that contributes to the total latency for that time period.
    • Some data for integrations might not be available or as expected in the following circumstances:
      • Metrics are not available for the first node or connector in a batch process.
      • Latency values are not shown for the events that trigger event-driven flows or Designer integrations. The time that elapses between when the event occurs in the source application and when App Connect receives the event is unknown.
      • Metrics might not be available for a flow when it is triggered immediately after the runtime where it is deployed starts. Data is gathered at 30-second intervals. If a flow is triggered before the first data is gathered, no metrics are available to show.
      • You might see some data up to 90 seconds later than you expect because of the frequency of data collection.

    To analyze the metrics, complete the following steps:

    1. Drill down into the data from the graphs in the following ways.
      • By default, the metrics relate to the top five integrations with the highest total flow runs or average latency, but you can also use the selector for each graph to show the lowest total flow runs or average latency.
        Selector for selecting the highest or lowest values
      • By default, the graphs show data for the top five integrations with the highest or lowest values. To show the data for only one integration, click the label for the integration in the legend. You can click that label again to show data for all the integrations.
        Clicking the label for an integration in the legend
      • To see data for an integration at a particular time, hover over that data point in the graph. You can see the number of flow runs or the latency (in milliseconds) for that flow.
        Hovering over a data point in the graph to see the number of flow runs or the latency at a particular time

    2. Scroll down to the table that lists metrics for all of your integrations, including those that are not shown in the graphs. For each deployed integration in an integration runtime, or each started or invoked flow in your Designer instance, the total flow runs and average latency are shown.
      • You can click the headers in the table to sort by integration name, total flow runs, or average latency.
      • If more than 10 items (or rows) are displayed in the table, you can use the Items per page selector to choose how many entries (up to a maximum of 400) to show on each page of the table.
        Depiction of the items per page selector
      • To view more detailed data for any integration, expand the row for that integration. You can see the name of the integration runtime (or Designer instance) to which the integration (or flow) is deployed, and also see the maximum, lowest, and average latency in milliseconds.
      Table that lists the total number of flow runs and average latency on the Integrations tab
    3. View detailed flow-runs and latency metrics for a single integration (or flow), and for the selected time period.
      1. Click the name of an integration (or flow) in the flow-runs or latency list that is adjacent to each graph, or click the name of an integration (or flow) in the table.
        Depicts the name of an integration in the "flow-runs" or "latency" list, or in the table

        Two graphs (labeled Flow runs and Latency) show the number of flow runs and the latency for that integration (or flow) at different points in time. The Values for selected time range lists show the total flow runs, and the average, minimum, and maximum latency for the integration (or flow). Next, you see a table that lists the nodes or connectors in the integration (or flow), the total number of times that they were started (triggered to complete an action), and the average latency.

        Two graphs show the number of flow runs and the latency values for an integration during a selected time period. A table also lists the nodes in the flow and how often each node was invoked during the time period.
      2. To return to the data for all integrations, click Monitor in the breadcrumbs in the page header.