Self-monitoring Quick Start Guide

About

Self-monitoring involves monitoring SevOne appliances through SevOne NMS, just as you would monitor any of your other devices. Monitoring your appliances helps you detect potential problems and address them immediately. It is also really useful for existing problems, especially when you do not know the source. Sometimes it is not immediately clear what kind of problem you are dealing with - it can be hardware, software, or network level issue. With self-monitoring, you can pinpoint the cause and resolve problems quickly, preventing downtime.

Note: Self-monitoring is a series of SevOne created API scripts that allows monitoring of core SevOne functions such as:
Function Description
MySQLMon Monitors MySQL database performance. By default, both config and data databases are configured for this.
SevOneMon Monitors SevOne internal data such as utilization, flow load, etc.
PolldMon Monitors SevOne polling daemon performance.

The metrics from these functions can be used to create reports and alerts when the state of the self-monitoring indicators changes and may indicate a problem with one of the SevOne peers.

Important: Starting SevOne NMS 6.7.0, MySQL has moved to MariaDB 10.6.12.
Note: In this guide if there is,
  • [any reference to master] OR
  • [[if a CLI command contains master] AND/OR
  • [its output contains master]],

    it means leader.

And, if there is any reference to slave, it means follower.

Prerequisites

  • root access to SevOne NMS appliance using ssh
  • IP address of SevOne NMS appliance
  • SSH client, such as PuTTY
  • Ensure that SevOneStats login is enabled prior to installing self-monitoring
  • The self-monitoring install script. Please contact SevOne Support for the latest install script, install.sh. The install script, which ships with the appliance, is located in /usr/local/scripts/utilities/plugins/selfmon

Prepare Appliance for Self-Monitoring

To prepare your appliance(s) for self-monitoring, execute the steps in the following sections.

Enable SevOneStats User & Change Password

To prepare your appliance(s) for self-monitoring, execute the steps in the following sections.

By default, SevOneStats user account is disabled; enable it. Once enabled, change SevOneStats password as the default password can result in security issue.

Important: If SevOneStats password is changed once self-monitoring is already in progress, it can interrupt the process.

Do not change SevOneStats password after self-monitoring is enabled. Doing so can result in self-monitoring process to stop.
Warning: SevOneStats user must have Administrator and SevOneStats role assigned otherwise, self-monitoring will break.

Execute the following steps to enable SevOneStats user account and to change its default password.

  1. From the navigation bar, click Administration and select Access Configuration, then User Manager.
    usermanager

  2. In the Search text field, type SevOneStats.
  3. Select the check box for SevOneStats and click wrenchIcon to display the Edit User pop-up.
    editUser

  4. In the Email field, enter an email address. An email address is required here in order to save the changes you will be making in the next steps.
  5. Under Credentials, locate the Password text field and enter a new password.
  6. In the Confirm text field, reenter the password.
  7. At the bottom of the pop-up, select the User Enabled check box to enable the SevOneStats user account.
  8. If the Force password change on next login check box is selected, clear it. This is important because it will ensure that a password change is not required on the next login.
  9. Select the Custom Inactive Timeout check box to enable the user to stay logged on during periods of inactivity for the amount of minutes you enter in the Custom Inactive Timeout field. This setting overrides the Inactivity Timeout setting you enter on the Cluster Manager > Cluster Settings tab. Leave clear to have the user log off after the amount of time you enter on the Cluster Manager. The user must log out and then log back on for this setting to take effect.
  10. Select the Custom Hard Timeout Setting check box to enable and use customized Hard Timeout setting (for the user you are adding or editing) you define on the Cluster Manager > Cluster Settings tab > Security subtab.
  11. To customize the hard timeout value for a user, select the Custom Hard Timeout Value check box to enable editing the hard timeout value for the user that you are either adding or editing. Enable the checkbox to allow you to enter the number of minutes in value field the user can remain alive before SevOne NMS automatically logs them out of the application. The default value is 15 minutes. Value field can range between 5 minute to 86400 minutes (60 days). When Custom Hard Timeout Value is enabled, the timeout set in its value field is used for the user in add or edit mode instead of the Hard Timeout value set on the Cluster Manager > Cluster Settings tab > Security subtab.
  12. To prevent the password from expiring, select the check box for The password for this user will never expire.
  13. Click Save.
    Note: Verify credentials / Check permissions

    Execute the following commands to verify the credentials.

    
    $ podman --url=unix:/run/podman_sevone/podman.sock exec -it nms-nms-nms /bin/bash
    
    $ /usr/local/scripts/utilities/plugins/selfmon/testApiUser.php -u SevOneStats -p <password>
    

    If the credentials are correct, it will return 1.

    The Self-monitoring install process verifies that SevOneStats ID has the following permissions.

    1. Can view reports
    2. Can view alerts
    3. Can create devices

    If SevOneStats has more or less permissions set, the installer will stop and warn about improper permission settings. To verify/modify the permissions, from the navigation bar, click Administration and select Access Configuration > User Manager. Search for SevOneStats user and set the permissions.

Add SevOne Appliance to Device Manager

To monitor your SevOne appliance or any other device, you need to add it to Device Manager

  1. Select the appliance that you would like to monitor your appliance from. You will be working from that appliance. For example, let's say you want to set up Appliance A to be monitored, and you have decided that Appliance B will do the monitoring. That means you will need to work from Appliance B.
  2. Now that the appliance is selected that will do the monitoring i.e., Appliance B, log into SevOne NMS on that appliance.
  3. From the navigation bar, click Devices and select Device Manager.
    devMgr

  4. On the right, under Devices, click Add Device.
    newDevice

  5. At the top of the New Device page, in the Name field, enter the name of the appliance that you want to monitor.
  6. In the Alternate Name field, enter an alternate name for the appliance. Users can search for the appliance by this name.
  7. In the Description field, enter a description of the appliance. You can use this to provide additional information about the appliance, such as location, etc.
  8. In the IP Address field, enter the IP address of the appliance.
  9. The Allow Deletion check box appears to the admin user only and is selected by default. When selected, it enables users to delete the device. If you would like to prevent users from deleting the appliance as a device, clear the check box.
  10. Below that, click the Device Groups drop-down and select All Device Groups.
  11. Configure other settings as needed. For additional details, please refer to SevOne NMS User Guide > sections New Device / Edit Device for adding / editing devices respectively.
    Important: If Self-monitoring is installed prior to manually adding devices in SevOne NMS, those devices will get added automatically as a result of installing self-monitoring.
  12. Click Save As New.
  13. You will see a message at the top of the page to inform you that the device is being queued for discovery.

Enable Self-monitoring on Appliance

Execute the following steps to enable self-monitoring on the appliance you want to monitor.

Important: Do not change SevOneStats password once self-monitoring is enabled as it can cause the self-monitoring process to stop.
  1. Using ssh, log into the appliance that you want to enable self-monitoring on.
  2. Run the install.sh script and follow the prompts.
    Note: Prior to installing self-monitoring, ensure that the appliance is in its final state, peered into the correct cluster, and has an IP address assigned to it.
    
    $ podman --url=unix:/run/podman_sevone/podman.sock exec -it nms-nms-nms /bin/bash
    
    $ /usr/local/scripts/utilities/plugins/selfmon/install.sh
    1. You will see a couple of warnings followed by the text Press Ctrl-C to abort or any key to continue... Hit Enter to proceed.
    2. You will see one more message, Press Ctrl-C to abort or any key if you are sure... Hit Enter again.
    3. You will be prompted to Enter the password for the API user SevOneStats.
      Note: Now you will see some information about your appliance such as, number of peers, whether the appliance is part of an HSA pair, and the IP address used for it. After this, the installation process starts. You will see a long list of information, especially about objects and indicator types. Once done, you will receive confirmation that the installation is complete.
  3. Validate self-monitoring has been installed by checking the root user's crontab for the following lines.
    
    $ grep selfmon /etc/cron.d/selfmon
    
    */5 * * * * root /usr/local/scripts/utilities/plugins/selfmon/SevOneMon/sevone.deferreddata.poll.sevone --api 127.0.0.1 --device-id "1" --object 'SevOne Statistics' >> /var/log/SevOne/SevOneMon.log 2>&1
    */5 * * * * root /usr/local/scripts/utilities/plugins/selfmon/MySQLMon/sevone.deferreddata.poll.mysql --api 127.0.0.1 --device-id "1" --database 127.0.0.1:3307 --object 'MySQL Config Database' --database-profile "config" >> /var/log/SevOne/MySQLMon.log 2>&1
    */5 * * * * root /usr/local/scripts/utilities/plugins/selfmon/MySQLMon/sevone.deferreddata.poll.mysql --api 127.0.0.1 --device-id "1" --database 127.0.0.1:3306 --object 'MySQL Data Database' --database-profile "data" >> /var/log/SevOne/MySQLMon.log 2>&1
    */5 * * * * root php /usr/local/scripts/utilities/plugins/selfmon/PolldMon/deferreddata.polld.php --api 127.0.0.1 --device-id "1" --polld-object 'SevOne-polld performance' --highpolld-object 'SevOne-highpolld performance' >> /var/log/SevOne/PolldMon.log 2>&1
    */5 * * * * root php /usr/local/scripts/utilities/plugins/selfmon/RedisMon/sevone.deferreddata.poll.redis --api 127.0.0.1 --device-id "1" --object 'Redis Instance' >> /var/log/SevOne/RedisMon.log 2>&1
    
    Important: It is best practice that a self-monitored appliance added as a device is not polled by itself; i.e., the appliance for which the self-monitoring is installed should not be the polling peer for the configured device. For additional details, please refer to SevOne NMS System Administration Guide > section Device Mover.

Create Object Rules

By creating object rules, you can make sure you are monitoring the objects you want to monitor on your SevOne appliance. You can also use object rules to exclude the objects that you do not want to monitor. Please refer to SevOne NMS System Administration Guide > section Object Rules for details on creating object rules.

What to Monitor

In this section, various components such as, CPU, Disk, etc. will be monitored. Along with this, focus will also be on the ideal ranges, signs of trouble, and some of the available indicators.

CPU

The main focus is on the total aggregate CPU usage. Looking at the sum of the CPU cores rather than at individual cores provides an overview of the CPU activity. If there is a problem with the total aggregate usage, it may be a good idea to start looking at individual cores to determine whether it is related to specific cores or all cores. This helps in determining whether the problem is related to a particular process or to the entire system.

Ideally, CPU usage will be in the following ranges.

  • Idle time >= 50% - most of the time, the CPU should not be doing a whole lot.
  • Waiting time <= 10% - if the waiting time is consistently 10% or higher, the system is too much in wait mode and must be looked into.

Disk

When looking at disks, the two main areas of focus are free disk space and input/output (I/O). The typical SevOne appliance has three main disk components.

  • sda
    • / - contains the entire operating system, including libraries, executables, etc.
    • /index - is used by the database for indexing activities. Allows distribution of read/write requests across another disk to improve performance.
  • sdb
    • /data - contains large amounts of data (covering long time spans), flow data, etc.
  • fioa
    • /ioDrive - optional Fusion-io SSD, included with higher-capacity appliances.

The following are the ideal amounts of free disk space.

  • /, /ioDrive >= 20% free
  • /data >= 10% free - if this is too full, the database can freeze up.

SNMP reports I/O statistics for entire disks as well as the individual partitions of each disk. For self-monitoring purposes, the entire disk is of interest. The following are three I/O states.

  • I/O is low - this means there is not much activity at the moment - so, nothing to worry about.
  • I/O is up and down - this means that there is some read/write in progress - this is usual work and no cause for concern.
  • I/O is high - if it is consistently high, this means that the disk is constantly reading and writing. This is a red flag and requires a closer look.

    Note: One possible cause of high I/O is hot standby synchronization, which involves copying lots of data from one appliance to another. However, it is always good to check into the exact cause of high I/O.

Memory

Memory applies to SevOne appliances. You will notice high memory usage. Depending on how much RAM an appliance has, it may be close to 100% memory usage. For appliances with greater RAM, memory usage will be high (due to Linux kernel caching) but not close to 100% usage. For example, if you write to a file and there is free space in RAM, the kernel will store the entire file in the RAM. So, when you read/write from/to the file, you are reading/writing from/to memory; making the process fast. The kernel then caches the file to the disk.

SevOne appliances ship with a lot of memory. Due to this, there is almost never a need to do the swapping. However, if a swap is required, it should not be over 1 GB. Before the swap, it is important to know why the memory needs to be swapped. The possible causes may include:

  • Memory leak
  • A poorly designed script

Used Memory

SNMP stats can be used to determine the amount of used memory. Along with indicator, Used Memory, you may also include Cached indicator and Buffers indicator to determine a more accurate usage of memory as some of it will be cached.

Used Memory


selfMonUsedMemory

Used Memory + Buffers + Cached


selfMonUsedMemoryAndBuffersAndCached

Processes

All processes with SevOne as part of its name, must be monitored. Process monitoring can be done using Process Poller and SNMP.

Process Poller

The Process Poller provides statistics for processes related to the following.

  • Apache
  • Bash
  • CRON scheduler
  • MySQL
  • PHP
  • xStats
  • etc.

Please refer to Process Poller monitoring information in SevOne NMS for a list of processes that are identified and monitored for the following information.

  • Availability
  • CPU Time
  • Instances of the process
  • System memory used - this refers to actual memory and does not include virtual memory.

SNMP

All major SevOne NMS processes connect to the database and issues queries. Every SevOne daemon exports SNMP information about its database use. You can look at the following daemons with SNMP.

  • SevOne–datad
  • SevOne-dispatchd
  • SevOne-polld
  • SevOne–requestd - provides percentage of requestd availability for each peer. An alert is raised when the availability is < 100%.
  • SevOne-trapd
  • and others...
Note: The exact list of processes depends on which appliance type is being monitored. For example, PAS, HSA, or DNC.

SNMP indicators provide several database statistics for processes. The following are a few to keep a watch on.

  • Query counts - any sudden changes in query counts may indicate a problem.
  • Query errors - query errors may be the result of a schema or a code issue.
  • Number of connections.
  • Number of reconnections.
  • Number of traps received, processed, etc. (applies to SevOne-trapd). Please make sure that the number of traps received is not much different from the number of traps processed. The numbers should be close to each other (within 100, for example). If the difference between traps received and traps processed is large, then you may be receiving more traps than the amount that can handled.

Hot Standby Appliance

A hot standby appliance (HSA) consists of an active peer and a passive peer. The active peer pulls configuration data from the Cluster Leader (in the case of a cluster) and polls data from objects and indicators on the devices that are assigned to it. The passive peer maintains a redundant copy of all configuration data and poll data. If you have a HSA, make sure that:

  • MySQL replication is working. If Availability drops to 0%, then there may be an issue with the MySQL daemon, mysqld.
  • the poller is running on the active appliance and not the passive appliance.
  • there is a lot of database activity on the active appliance but not the passive appliance.

For HSA monitoring, SevOne recommends using the Deferred Data plugin. Select the SevOne Statistics object (for example, when using Instant Graphs) and the following indicators:

  • Is master - indicates whether a peer is the leader.
  • Is primary - indicates whether a peer is the primary.

If the passive peer of an HSA is reporting an Is master value of 1, this indicates that a switchover has occurred. If both the active peer and the passive peer are reporting an Is master value of 1, then a split-brain condition exists.

Interface

Because SevOne NMS monitors network devices, your SevOne appliance requires some bandwidth but not much. The normal range for a single appliance falls somewhere between 1 and 3%. For an HSA, you will see anywhere from two to four times as much bandwidth used. If the amount of bandwidth used is consistently high, this could indicate a problem, and you will need to look into what is causing the high bandwidth.

With an HSA, you will see peaks in traffic every two hours. This happens because data is being transferred from short-term storage to long-term storage, which results in additional MySQL replication.

The graph below displays bandwidth information using the indicators HC In Octets and HC Out Octets.


selfMonInterface

Ready for Self-monitoring

Create Policy

Policies allow you to receive alerts when condition(s) specified in the policy are triggered. From policies, you may define thresholds for the device and object groups. Please refer to SevOne NMS User Guide > section Policy Browser for details on creating / editing a policy.

In section Recommended Policies & Thresholds, you will find policies recommended for SevOne self-monitoring. Create a policy.

Create Threshold

Policies apply to entire device groups or object groups. Thresholds can be used to to receive alerts when conditions are triggered for a specific device. Please refer to SevOne NMS User Guide > section Threshold Browser for details on creating / editing a threshold.

In section Recommended Policies & Thresholds, you will find thresholds recommended. Create a threshold.

Alerts

To access the Alerts page from the navigation bar, click Events and select Alerts. The Alerts page allows you to look at the current, active alerts in the system. These include the threshold violations, trap notifications, and web site errors defined in the Policy Browser or the Threshold Browser.

Note: The alerts page allows you to add filters to focus on the display results. After filters are added, click Apply Filter for the filters to be applied.

Filters are optional.

Please refer to SevOne NMS User Guide > section Alerts for details.

Instant Graphs

To access the Instant Graphs page from the navigation bar, click Reports and select Instant Graphs. You may create statistical graphs for the objects and indicators on devices. Instant graphs are easy and fast to set up and allows you to immediately look at the potential problem areas.

Uninstall Self-monitoring

There are a handful of instances when it may become necessary to uninstall the Self-monitoring components. Removal and re-installation is the most efficient way to update Self-monitoring in the event of an IP address change, SevOneStats user password change, etc. Users must ensure that their root user's crontab is backed up prior to executing the following commands as this script makes modifications to the crontab.

  1. Backup root user's crontab.
    $ crontab -l > /root/root_crontab_backup-`date +%s`.txt
  2. Change directory to /usr/local/scripts/utilities/plugins/selfmon.
    $ cd /usr/local/scripts/utilities/plugins/selfmon
  3. Run uninstall.sh script and follow the prompts OR run the script in a single line as shown below.
    $ (echo -e '\n') | /usr/local/scripts/utilities/plugins/selfmon/uninstall.sh
  4. Validate that the root user's crontab no longer contains the following lines.
    
    */5 * * * * /usr/local/scripts/utilities/plugins/selfmon/SevOneMon/sevone.deferreddata.poll.sevone --api 127.0.0.1 --device-id "2" --object 'SevOne Statistics' >> /var/SevOne/SevOneMon.log 2>&1
    */5 * * * * /usr/local/scripts/utilities/plugins/selfmon/MySQLMon/sevone.deferreddata.poll.mysql --api 127.0.0.1 --device-id "2" --database 127.0.0.1:3307 --object 'MySQL Config Database' --database-profile "config" >> /var/SevOne/MySQLMon.log 2>&1
    */5 * * * * /usr/local/scripts/utilities/plugins/selfmon/MySQLMon/sevone.deferreddata.poll.mysql --api 127.0.0.1 --device-id "2" --database 127.0.0.1:3306 --object 'MySQL Data Database' --database-profile "data" >> /var/SevOne/MySQLMon.log 2>&1
    */5 * * * * php /usr/local/scripts/utilities/plugins/selfmon/PolldMon/deferreddata.polld.php --api 127.0.0.1 --device-id "2" --polld-object 'SevOne-polld performance' --highpolld-object 'SevOne-highpolld performance' >> /var/SevOne/PolldMon.log 2>&1
    */5 * * * * php /usr/local/scripts/utilities/plugins/selfmon/RedisMon/sevone.deferreddata.poll.redis --api 127.0.0.1 --device-id "2" --object 'Redis Instance' >> /var/SevOne/RedisMon.log 2>&1

Troubleshooting

Changed SevOneStats password after enabling Self-monitoring. Now, it is not working

SevOne strongly suggests against changing the password for the API user SevOneStats after self-monitoring is enabled. However, if you have changed the password once self-monitoring is already in progress and you are experiencing problems, please contact the SevOne Support team for assistance.