Monitoring API Gateway

As part of application monitoring, you can monitor the state, that is the cluster status and console access of API Gateway along with the resources.

How do I monitor the health of API Gateway?

Prerequisites:

You must have a valid API Gateway user credential for using the Readiness Probe, Runtime Service Health Probe, and Administration Service Health Probe.
All the node level probes must be setup to target the local instance, typically, localhost.
Software AG recommends to set up a dedicated port for monitoring with an appropriate private thread pool.

Readiness Probe at Node-Level

To monitor the readiness of API Gateway, that is to check if the traffic-serving port of a particular API Gateway node is ready to accept requests, use the following REST endpoint:

GET /rest/apigateway/health

The following table shows the response code and the description.

Response	Description
`200 OK`	Readiness check is successful. Readiness probe continues to reply OK if API Gateway remains in an operational state to serve the requests.
`500 Internal server error`	Readiness check failed and denotes a problem.
`timeout` or `no response as the request did not reach the probe`	Several factors can contribute to the delay when the Readiness Probe initiates, which may result in the timeout errors. To know the reasons for timeout errors, see Causes for timeout errors for more information.

Note: As this is a Readiness Probe and only the response status code is essential, by design, JSON payload is not returned in the response for both success and failure scenarios.

Runtime Service Health Probe at Node-Level

To monitor the runtime service health of of API Gateway, that is to check the overall cluster health and to identify if the components of a particular API Gateway node are in an operational state, use the following REST endpoint:

GET /rest/apigateway/health/engine

The following table shows the response code and the description.

Response	Description
`200 OK`	Runtime service health check is successful.
`500 Internal server error`	Runtime service health check failed and denotes a problem. The response JSON indicates the problem.
`timeout` or `no response as the request did not reach the probe`	Several factors can contribute to the delay when the Runtime Service Health Probe initiates, which may result in the timeout errors. To know the reasons for timeout errors, see Causes for timeout errors for more information.

The response JSON of each health check request displays a status field in the response.

The overall status of API Gateway node can be green ,yellow, or red.

Status	Description
`green`	Indicates that the cluster within the node is in a healthy state.
`yellow`	Indicates that API Gateway does not have adequate resources to run.
`red`	Indicates the cluster failure in the node and an outage.

The overall status of API Gateway node is assessed based on the API Data Store status, API Gateway resource status, and the Terracotta server status.

API Data Store status

Status	Description
`green`	Indicates that API Data Store is in a healthy state. When the status of API Data Store signals `green` or `yellow`, the overall status of API Gateway is `green`.
`red`	Indicates that API Data Store is not in a healthy state. When the status of API Data Store signals `red`, the overall status of API Gateway is `red`.
`yellow`	Indicates a node failure in the cluster. However, the cluster is still functioning and operational.

API Gateway resource status

Status	Description
`green`	Indicates that API Gateway resource types like memory, disk space, and service threads are available to run.
`yellow`	Indicates that API Gateway does not have adequate resources to run. When the API Gateway resource status is `yellow`, the overall status of API Gateway is `yellow`.

Terracotta Server Array status

Status	Description
`green`	Indicates that Terracotta server is in a healthy state. When the status of Terracotta server signals `green`, the overall status of API Gateway is `green`.
`red`	Indicates that Terracotta server is not in a healthy state. When the status of Terracotta server signals `red`, the overall status of API Gateway is `red`.

A sample HTTP response is as follows:


{
    "status": "green",
    "elasticsearch": {
        "cluster_name": "SAG_EventDataStore",
        "status": "yellow",
        "number_of_nodes": "1",
        "number_of_data_nodes": "1",
        "timed_out": "false",
        "active_shards": "95",
        "initializing_shards": "0",
        "unassigned_shards": "92",
        "task_max_waiting_in_queue_millis": "0",
        "port_9240": "ok",
        "response_time_ms": "526"
    },
    "is": {
        "status": "green",
        "diskspace": {
            "status": "up",
            "free": "908510568448",
            "inuse": "104799719424",
            "threshold": "101331028787",
            "total": "1013310287872"
        },
        "memory": {
            "status": "up",
            "freemem": "425073672",
            "maxmem": "954728448",
            "threshold": "92222259",
            "totalmem": "922222592"
        },
        "servicethread": {
            "status": "up",
            "avail": "72",
            "inuse": "3",
            "max": "75",
            "threshold": "7"
        },
        "response_time_ms": "258"
    },
    "terracotta": {
        "status": "green",
        "nodes": "1",
        "healthy_nodes": "1",
        "response_time_ms": "22"
    }
}

The overall engine status is green since all components work as expected.

Administration Service Health Probe at Node-Level

To check the availability and health status of the API Gateway administration service (UI, Dashboards) on a particular API Gateway node, use the following rest endpoint:

GET /rest/apigateway/health/admin

The following table shows the response code and the description.

Response	Description
`200 OK`	Administration service health check is successful.
`500 Internal server error`	Denotes a problem. The response JSON indicates the problem.
`timeout` or `no response as the request did not reach the probe`	Several factors can contribute to the delay when you initiate the Administration Service Health Probe, which may result in the timeout errors. To know the reasons for timeout errors, see Causes for timeout errors for more information.

The overall Administration Service Health Probe status can be green or red based on the API Gateway administration service's status and Kibana's status.

Kibana status

Status	Description
`green`	Indicates that Kibana's port is accessible. When the status signals `green`, the overall status of Administration Service Health Probe is `green`.
`red`	Indicates that either Kibana's port is inaccessible or Kibana's communication with API Data Store is not established. When the status signals `red`, the overall status of Administration Service Health Probe is `red`.

API Gateway administration service status

Status	Description
`green`	Indicates that API Gateway administration service is available. When the status signals `green`, the overall status of Administration Service Health Probe is `green`.
`red`	Indicates that API Gateway administration service is not available. When the status signals `red`, the overall status of Administration Service Health Probe is `red`.

A sample HTTP response is as follows:


{
    "status": "green",
    "ui": {
        "status": "green",
        "response_time_ms": "40"
    },
    "kibana": {
        "status": {
            "overall": {
                "state": "green",
                "nickname": "Looking good",
                "icon": "success",
                "uiColor": "secondary"
            }
        },
        "response_time_ms": "36"
    }
}

The overall status is green since API Gateway administration service and Kibana is in a healthy state.

How do I collect metrics?

To check the usage of the application and system parameters, use the following metrics endpoint: GET /metrics. When the endpoint is called, API Gateway gathers metrics and returns the data in the Prometheus format.

Note: Prometheus is a non- Software AG dashboarding tool that helps in trend analysis. For more information, see https://prometheus.io/.

Prometheus metrics are exposed through the following endpoint.

[http|https]://host:port/metrics

The metrics endpoint by default is available on the following ports:

Default primary port (http). 5555
Default secure port (https). 5543
Default diagnostic port (debug port). 9999

A sample for the metrics endpoint is as follows:

http://server:5555/metrics

Authentication for the metrics endpoint

By default, the authentication is disabled when running API Gateway as Docker container.
For on-premise installations, the following environment variable can be set to switch off the authentication for the metrics endpoint:
```
SAG_IS_METRICS_ENDPOINT_ACL=Anonymous
```

The endpoint also exposes the Integration Server Prometheus metrics. For more details on the Integration Server Prometheus metrics, see Developing Microservices with webMethods Microservices Runtime.

Exposing API Gateway Prometheus Metrics over a dedicated port

The metrics endpoint can be made available on a custom port. After creating the port, add the following service to the port's allow list:

wm.server.query:getPrometheusStats

Similarly, the metrics endpoint can be removed from the default ports (5555 or 5543 or 9999) by removing the service from the allow or deny lists.