Cluster-level Monitoring

The cluster-level monitoring ensures service availability, that is, availability of access and functionality (serving API requests) of webMethods API Gateway. Through cluster-level monitoring, you can check:
  • If the runtime is available and ready to serve the traffic.
  • If the API Gateway administrator console is accessible.

How do I monitor the cluster health of webMethods API Gateway?

You can set up the Readiness Probe, Runtime Service Health Probe, and Administration Service Health Probe to monitor the cluster health.
Requirement Type of Impact Solution
For API Gateway, is there an endpoint that returns yes or no about its service availability, that is, readiness for serving the incoming API requests? Business Impact. To know if there is an outage in API Gateway. Use Readiness Probe.
For API Gateway, is there an endpoint that indicates the availability of the administrator user consoles? Operational Impact. To know if the administrator user console is available. Use Administration Service Health Probe.
For API Gateway, is there an endpoint that indicates the cluster health and its details? Technical Impact. To know the details about where the fault lies when there is a cluster failure. Use Runtime Service Health Probe.

How do the probes help in cluster-level monitoring?

Readiness Probe Runtime Service Health Probe Administration Service Health Probe
What is it? Indicates if the traffic-serving port of webMethods API Gateway is ready to accept requests. Reports on the overall cluster health and indicates if the components of webMethods API Gateway are in an operational state. Indicates if the webMethods API Gateway administrator console is available and accessible.
When is it used? To continuously check and report on the service availability of webMethods API Gateway. To continuously report on the cluster health with the details of the components involved in clustering. To continuously report on the availability of the administrator console and API analytics.
Note: The points in the table are also applicable to scenarios where the cluster health is NOT OK, for example, API Data Store or Terracotta failure. Such scenarios do not always mean an outage. webMethods API Gateway may still be able to process the requests.

How do I set up probes?

Prerequisites:
  • You must have a valid webMethods API Gateway user credential for using the Readiness Probe, Runtime Service Health Probe, and Administration Service Health Probe.
  • All the cluster-level probes must be setup to target webMethods API Gateway load balancer endpoint.
  • IBM recommends to set up a dedicated port for monitoring with an appropriate private thread pool.

Readiness Probe at Cluster-Level

To monitor the readiness of webMethods API Gateway, that is to check if webMethods API Gateway is ready to accept the requests, use the following REST endpoint:

GET /rest/apigateway/health

The following table shows the response code and the description.

Response Description
200 OK Readiness check is successful. Readiness probe continues to reply OK if webMethods API Gateway remains in an operational state to serve the requests.
500 Internal server error Readiness check failed and denotes a problem.
timeout or no response as the request did not reach the probe Several factors can contribute to the delay when the Readiness Probe initiates, which may result in the timeout errors. To know the reasons for timeout errors, see Causes for timeout errors for more information.
Note: As this is a Readiness Probe and only the response status code is essential, by design, JSON payload is not returned in the response for both success and failure scenarios.

Runtime Service Health Probe at Cluster-Level

To monitor the runtime service health of webMethods API Gateway, that is to check the cluster health of webMethods API Gateway, use the following REST endpoint:

GET /rest/apigateway/health/engine

The following table shows the response code and the description.

Response Description
200 OK Runtime service health check is successful. The response is 200 OK when all APIs are activated after a startup.
500 Internal server error Runtime service health check failed and denotes a problem. The response JSON indicates the problem.
timeout or no response as the request did not reach the probe Several factors can contribute to the delay when the Runtime Service Health Probe initiates, which may result in the timeout errors. To know the reasons for timeout errors, see Causes for timeout errors for more information.

The response JSON of each health check request displays a status field in the response.

The overall status of webMethods API Gateway cluster can be green, yellow, and red.

Status Description
green Indicates that the cluster is in a healthy state.
yellow Indicates that webMethods API Gateway does not have adequate resources to run.
red Indicates the cluster failure and an outage.

The overall status of webMethods API Gateway cluster is assessed based on the API Data Store status, webMethods API Gateway resource status, and the cluster status within nodes.

API Data Store status

Status Description
green Indicates that API Data Store is in a healthy state. When the status of API Data Store signals green or yellow, the overall status of API Gateway is green.
red Indicates cluster failure and an outage. When the status of API Data Store signals red, the overall status of API Gateway is red.
yellow Indicates a node failure in the cluster. However, the cluster is still functioning and operational.

API Gateway resource status

Status Description
green Indicates that API Gateway resource types like memory, disk space, and service threads are available to run.
yellow Indicates that API Gateway does not have adequate resources to run. When the API Gateway resource status is yellow, the overall status of API Gateway is yellow.

Cluster status within nodes

Status Description
green Indicates that cluster is in a healthy state. The cluster status is green only when Terracotta Server Array is up and running. When the status of the cluster signals green, the overall status of API Gateway is green.
red Indicates cluster failure and an outage. When the status of the cluster signals red, the overall status of API Gateway is red.
A sample HTTP response is as follows:

{
    "status": "green",
    "elasticsearch": {
        "cluster_name": "api_gateway_cluster",
        "status": "green",
        "number_of_nodes": "3",
        "number_of_data_nodes": "3",
        "timed_out": "false",
        "active_shards": "200",
        "initializing_shards": "0",
        "unassigned_shards": "0",
        "task_max_waiting_in_queue_millis": "0",
        "node": "localhost:9240",
        "response_time_ms": "4"
    },

    "is": {
        "status": "green",
        "diskspace": {
            "status": "up",
            "free": "14206386176",
            "inuse": "17994313728",
            "threshold": "3220069990",
            "total": "32200699904"
        },
        "memory": {
            "status": "up",
            "freemem": "420766624",
            "maxmem": "2147483648",
            "threshold": "161061273",
            "totalmem": "1610612736"
        },
        "servicethread": {
            "status": "up",
            "avail": "397",
            "inuse": "3",
            "max": "400",
            "threshold": "40"
        },
        "response_time_ms": "309"
    },
    "cluster": {
        "status": "green",
        "isClusterAware": "true",
        "nodes": "3",
        "response_time_ms": "518"
    },
    "runtime": {
       "status": "green",
       "start_mode": "up"    
    }
}
The overall cluster status of API Gateway is green since all components work as expected.

Administration Service Health Probe at Cluster Level

To check the availability and health status of the webMethods API Gateway administration service (UI, Dashboards) at the cluster level, use the following REST endpoint:

GET /rest/apigateway/health/admin

The following table shows the response code and the description.

Response Description
200 OK Administration service health check is successful.
500 Internal server error Denotes a problem. The response JSON indicates the problem.
timeout or no response as the request did not reach the probe Several factors can contribute to the delay when you initiate the Administration Service Health Probe, which may result in the timeout errors. To know the reasons for timeout errors, see Causes for timeout errors for more information.

The overall Administration Service Health Probe status can be green or red based on the webMethods API Gateway administration service's status and Kibana status.

Kibana status

Status Description
green Indicates that Kibana's port is accessible. When the status signals green, the overall status of Administration Service Health Probe is green.
red Indicates that either Kibana's port is inaccessible or Kibana's communication with API Data Store is not established. When the status signals red, the overall status of Administration Service Health Probe is red.

API Gateway administration service status

Status Description
green Indicates that webMethods API Gateway administration service is available. When the status signals green, the overall status of Administration Service Health Probe is green.
red Indicates that webMethods API Gateway administration service is not available. When the status signals red, the overall status of Administration Service Health Probe is red.
A sample HTTP response is as follows:

{
    "status": "green",
    "ui": {
        "status": "green",
        "response_time_ms": "40"
    },
    "kibana": {
        "status": {
            "overall": {
                "state": "green",
                "nickname": "Looking good",
                "icon": "success",
                "uiColor": "secondary"
            }
        },
        "response_time_ms": "36"
    }
}

The overall status is green since webMethods API Gateway administration service and Kibana is in a healthy state.