Health Monitor Plugin

Edit online

The Health Monitor plugin adds an HTTP REST endpoint to the URL of the session to which the realm server is connected. This allows clients to query the current state of the realm server. The endpoint defines the "liveness" of the server. The plugin returns the result of the health checks that run on the server at periodic intervals.

Adding the plugin to the realm server

The plugin can be added to a realm server using either of the following methods:

Using the Add Plugin feature in the Comms > Interfaces > Plugins dialog of the Enterprise Manager. Specify the value of the mount point for the URL endpoint in the URL Path field of the Add Plugin dialog.

See the section Plugins for details.
Using the command line tool AddHealthMonitorPlugin. Specify the value of the mount point for the URL endpoint in the -mountpath argument of the tool.

For details of running this command line tool, see the section Syntax: Miscellaneous Tools.

Endpoints

Endpoints do not support authentication and to ensure secure access to the endpoint, you must configure the endpoint over an HTTPs interface. If the requested endpoint is not valid, the browser returns a "404" error code with a "Not Found" response status.

Following is a list of the available endpoints and their function:

HealthMonitor
To access the HealthMonitor endpoint, use either of the following URL formats: <mount point>/ or <mount point>/health.

For example, if your server mount point is "monitor," the URL is: http://localhost:9000/monitor/ or http://localhost:9000/monitor/health.

If the server mount point is blank, the URL is: http://localhost:9000/ or http://localhost:9000/health.

The realm server runs four different tasks at regular intervals on the server to monitor the health status of the server:
- Memory monitor Task (MemoryHealthMonitor):
  This task monitors the memory status of the server and will produce an alert/error as soon as any memory related issues are found. The task checks the heap and direct memory usage, and if the usage exceeds a threshold value of 95%, then it is considered as an error and the error will be reported and logged.
  
  The server is not considered to be unhealthy when the first such error occurs; instead a server is considered unhealthy only if the memory monitor task returns an error 3 times consecutively.
- Stalled-Tasks Monitor Task (StalledTasksMonitor):
  If any thread pool has more than 5 stalled tasks, this is considered an error, but the status is only reported as unhealthy if the error occurs 5 times consecutively.
- Cluster state monitor Task (ClusterStateMonitor):
  If a server is configured to be part of a cluster, and the last time that the server successfully joined the cluster is more than 600000 milliseconds (10 minutes) ago, then the server is considered to be unhealthy.
- Server round trip Task (ServerRoundTripMonitor):
  The server round trip checks the processing time in a cluster. nClusterRoundTripEvent events are synchronous events that measure the processing time in a cluster. The realm server sends these events into the cluster and records the time it takes to complete the processing in cluster. If the event takes more than 30 seconds to complete processing and get acknowledged, then that is considered to be an error. Not getting an acknowledgement back for this event is also considered to be an error. If 5 consecutive such errors, the server is considered to be unhealthy.
Server Responses

If the server is fully operational, and is an active member of a cluster, the query returns a response "OK" of the following form:
```
{"ServerStatus":"OK","ServerStatusDetails":"{}"}
```
Even when the return code is "OK", the response can contain additional information (in JSON format), such as useful statistics, or as an indication that the server is approaching certain limits.

If the server is not fully operational, the query returns a status "ERROR" with an appropriate description of the problem, for example:
```
{"ServerStatus":"ERROR","ServerStatusDetails":
  {"MemoryHealthMonitor":
    "Max threshold of used Heap memory is exceeded, Heap memory used - 338 MB"
  }
}
```
IsMaster
The IsMaster endpoint determines whether the current server is the primary node in a cluster.

To access the IsMaster endpoint use the "IsMaster" mount point. This mount point is not case-sensitive. "IsMaster," "isMaster," and "ismaster" are all valid values.

For example, if your server mount point is "monitor," the URL is: http://localhost:9000/monitor/isMaster.

Server Responses

If the server is fully operational, and is the primary node of the cluster, the query returns a response "OK (200)" of the following form:
```
{"ClusterState":"MASTER"}
```
If the server is not fully operational, the query returns a status "Service Unavailable” (503)" with an appropriate description of the problem, for example:
- The server is part of an active cluster as a secondary node:
```
{"ClusterState":"SLAVE"}
```
- The server is part of a cluster which is not active:
```
{"ClusterState":"OFFLINE"}
```
- The server is part of an active cluster, but is currently reconnecting to it:
```
{"ClusterState":"RECOVERY"}
```
- The server is not part of a cluster:
```
{"ClusterState":"NON_CLUSTERED"}
```
GetClusterState
The GetClusterState endpoint determines the current cluster state of a given server.

To access the GetClusterState endpoint use the "GetClusterState" mount point. This mount point is not case-sensitive. "GetClusterState," "getClusterState," and "getclusterstate" are all valid values.

For example, if your server mount point is "monitor," the URL is: http://localhost:9000/monitor/getClusterState.

Server Responses

Regardless of the cluster state of the server, the query returns the response "OK (200)" for the same scenarios described for the IsMaster endpoint.