Monitoring API Gateway Health

How do I monitor the health of API Gateway?

Prerequisites:

You must have a valid webMethods API Gateway user credential for using the Readiness Probe, Runtime Service Health Probe, and Administration Service Health Probe.
All the node level probes must be setup to target the local instance, typically, localhost.
IBM recommends to set up a dedicated port for monitoring with an appropriate private thread pool.

Readiness Probe at Node-Level

To monitor the readiness of API Gateway, that is to check if the traffic-serving port of a particular webMethods API Gateway node is ready to accept requests, use the following REST endpoint:

GET /rest/apigateway/health

The following table shows the response code and the description.

Response	Description
`200 OK`	Readiness check is successful. Readiness probe continues to reply OK if webMethods API Gateway remains in an operational state to serve the requests.
`500 Internal server error`	Readiness check failed and denotes a problem.
`timeout` or `no response as the request did not reach the probe`	Several factors can contribute to the delay when the Readiness Probe initiates, which may result in the timeout errors. To know the reasons for timeout errors, see Causes for timeout errors for more information.

Note: As this is a Readiness Probe and only the response status code is essential, by design, JSON payload is not returned in the response for both success and failure scenarios.

Runtime Service Health Probe

To monitor the runtime service health of of webMethods API Gateway, that is to check the overall cluster health and to identify if the components of a particular webMethods API Gateway node are in an operational state, use the following REST endpoint:

GET /rest/apigateway/health/engine

The following table shows the response code and the description.

Response	Description
`200 OK`	Runtime service health check is successful.
`500 Internal server error`	Runtime service health check failed and denotes a problem. The response JSON indicates the problem.
`timeout` or `no response as the request did not reach the probe`	Several factors can contribute to the delay when the Runtime Service Health Probe initiates, which may result in the timeout errors. To know the reasons for timeout errors, see Causes for timeout errors for more information.

The response JSON of each health check request displays a status field in the response.

The overall status of webMethods API Gateway node can be green ,yellow, or red.

Status	Description
`green`	Indicates that the cluster within the node is in a healthy state.
`yellow`	Indicates that API Gateway does not have adequate resources to run.
`red`	Indicates the cluster failure in the node and an outage.

The overall status of API Gateway node is assessed based on the API Data Store status, API Gateway resource status, and the Terracotta server status.

API Data Store status

Status	Description
`green`	Indicates that API Data Store is in a healthy state. When the status of API Data Store signals `green` or `yellow`, the overall status of API Gateway is `green`.
`red`	Indicates that API Data Store is not in a healthy state. When the status of API Data Store signals `red`, the overall status of API Gateway is `red`.
`yellow`	Indicates a node failure in the cluster. However, the cluster is still functioning and operational.

API Gateway resource status

Status	Description
`green`	Indicates that API Gateway resource types like memory, disk space, and service threads are available to run.
`yellow`	Indicates that API Gateway does not have adequate resources to run. When the API Gateway resource status is `yellow`, the overall status of API Gateway is `yellow`.

Terracotta Server Array status

Status	Description
`green`	Indicates that Terracotta server is in a healthy state. When the status of Terracotta server signals `green`, the overall status of webMethods API Gateway is `green`.
`red`	Indicates that Terracotta server is not in a healthy state. When the status of Terracotta server signals `red`, the overall status of webMethods API Gateway is `red`.

A sample HTTP response is as follows:


{
    "status": "green",
    "elasticsearch": {
        "cluster_name": "SAG_EventDataStore",
        "status": "yellow",
        "number_of_nodes": "1",
        "number_of_data_nodes": "1",
        "timed_out": "false",
        "active_shards": "95",
        "initializing_shards": "0",
        "unassigned_shards": "92",
        "task_max_waiting_in_queue_millis": "0",
        "port_9240": "ok",
        "response_time_ms": "526"
    },
    "is": {
        "status": "green",
        "diskspace": {
            "status": "up",
            "free": "908510568448",
            "inuse": "104799719424",
            "threshold": "101331028787",
            "total": "1013310287872"
        },
        "memory": {
            "status": "up",
            "freemem": "425073672",
            "maxmem": "954728448",
            "threshold": "92222259",
            "totalmem": "922222592"
        },
        "servicethread": {
            "status": "up",
            "avail": "72",
            "inuse": "3",
            "max": "75",
            "threshold": "7"
        },
        "response_time_ms": "258"
    },
    "terracotta": {
        "status": "green",
        "nodes": "1",
        "healthy_nodes": "1",
        "response_time_ms": "22"
    }
}

The overall engine status is green since all components work as expected.

Administration Service Health Probe

To check the availability and health status of the API Gateway administration service (UI, Dashboards) on a particular API Gateway node, use the following rest endpoint:

GET /rest/apigateway/health/admin

The following table shows the response code and the description.

Response	Description
`200 OK`	Administration service health check is successful.
`500 Internal server error`	Denotes a problem. The response JSON indicates the problem.
`timeout` or `no response as the request did not reach the probe`	Several factors can contribute to the delay when you initiate the Administration Service Health Probe, which may result in the timeout errors. To know the reasons for timeout errors, see Causes for timeout errors for more information.

The overall Administration Service Health Probe status can be green or red based on the webMethods API Gateway administration service's status and Kibana's status.

Kibana status

Status	Description
`green`	Indicates that Kibana's port is accessible. When the status signals `green`, the overall status of Administration Service Health Probe is `green`.
`red`	Indicates that either Kibana's port is inaccessible or Kibana's communication with API Data Store is not established. When the status signals `red`, the overall status of Administration Service Health Probe is `red`.

API Gateway administration service status

Status	Description
`green`	Indicates that webMethods API Gateway administration service is available. When the status signals `green`, the overall status of Administration Service Health Probe is `green`.
`red`	Indicates that webMethods API Gateway administration service is not available. When the status signals `red`, the overall status of Administration Service Health Probe is `red`.

A sample HTTP response is as follows:


{
    "status": "green",
    "ui": {
        "status": "green",
        "response_time_ms": "40"
    },
    "kibana": {
        "status": {
            "overall": {
                "state": "green",
                "nickname": "Looking good",
                "icon": "success",
                "uiColor": "secondary"
            }
        },
        "response_time_ms": "36"
    }
}

The overall status is green since webMethods API Gateway administration service and Kibana is in a healthy state.