Using the health monitor

You set variables in the vault file to configure the DR health monitoring. The health monitoring is run by a Cron job script, /usr/bin/resDRstatus, which runs every minute. The Cron job file is resilient_dr-monitor. It is created and set up on the master appliance when you run the enable.dr action to enable the DR solution. If you disable DR, it is removed. The default DR tag in Syslog is resilient-dr.

The DR system involves database replication and file system synchronization. The health monitor checks both the postgres database replication and the SOAR Platform file system synchronization and logs a status message for each. For example:
Postgresql Replication Status: Running, Replication delay=56 bytes 
File Replication Status: Running (Synced)
For example, for file system replication:
resilient-filesync service isn't running

The status message is logged with priority info to Syslog. If there are problems, messages are logged as warn or error. For example, if database replication is not running, this is logged as a warn message.

An error message is generated in the following circumstances:
  • If the number of postgres replication slots is not equal to one. If there are zero replication slots or if there is more than one, there are two slightly different error messages.
  • If the number of postgres replication connections is greater than one. This indicates an unwanted connection.
  • If postgres replication is running and the number of replication bytes exceeds the lag threshold in bytes for postgres.
  • If the resilient-filesync service is running AND the delay is greater than the lag threshold for resilient-fileync (set as number of seconds).
A warn message is generated in the following circumstances:
  • When the number of replication slots is greater than or equal to one and there are no replication connections, where there is no receiver receiving a stream from the master.
  • When the number of replication slots is greater than or equal to one and the retained transactions (in bytes) is greater than the replication retained threshold (in bytes).
  • When the resilient-filesync service is not running.

An info message is generated every minute, when the resDrStatus script runs. The resDrStatus script outputs the current postgres and resilient-filesync service status.

The /var/log/res-dr-status.log file contains all entries generated by the Disaster Recovery monitor.

The health monitor checks that database replication is running and generates a Syslog message if the replication delay is more than the configured threshold, which is configured in megabytes. You configure the threshold value in the vault file, as described in Step 5: Creating Ansible vault files.

The health monitor also checks the status of the resilient-filesync service and generates a Syslog message if the service is not running or if the delay is greater than the specified threshold, which you configure in the vault file, as a delay in seconds.

If the receiver system is not running, the master system saves the transactions that are running. When the size of saved transactions reach the value specified in the vault_vars_dr_monitor_postgres_retained_threshold variable in the vault file, a warn message is generated to alert you that the value has been reached.