Self-monitoring Quick Start Guide
About
Self-monitoring involves monitoring SevOne appliances through SevOne NMS, just as you would monitor any of your other devices. Monitoring your appliances helps you detect potential problems and address them immediately. It is also really useful for existing problems, especially when you do not know the source. Sometimes it is not immediately clear what kind of problem you are dealing with - it can be hardware, software, or network level issue. With self-monitoring, you can pinpoint the cause and resolve problems quickly, preventing downtime.
Function | Description |
---|---|
MySQLMon | Monitors MySQL database performance. By default, both config and data databases are configured for this. |
SevOneMon | Monitors SevOne internal data such as utilization, flow load, etc. |
PolldMon | Monitors SevOne polling daemon performance. |
The metrics from these functions can be used to create reports and alerts when the state of the self-monitoring indicators changes and may indicate a problem with one of the SevOne peers.
- [any reference to master] OR
- [[if a CLI command contains master] AND/OR
- [its output contains master]],
it means leader.
And, if there is any reference to slave, it means follower.
Prerequisites
- root access to SevOne NMS appliance using ssh
- IP address of SevOne NMS appliance
- SSH client, such as PuTTY
- Ensure that SevOneStats login is enabled prior to installing self-monitoring
- The self-monitoring install script. Please contact SevOne Support for the latest install script, install.sh. The install script, which ships with the appliance, is located in /usr/local/scripts/utilities/plugins/selfmon
Prepare Appliance for Self-Monitoring
To prepare your appliance(s) for self-monitoring, execute the steps in the following sections.
Enable SevOneStats User & Change Password
To prepare your appliance(s) for self-monitoring, execute the steps in the following sections.
By default, SevOneStats user account is disabled; enable it. Once enabled, change SevOneStats password as the default password can result in security issue.
Do not change SevOneStats password after self-monitoring is enabled. Doing so can result in self-monitoring process to stop.
Execute the following steps to enable SevOneStats user account and to change its default password.
- From the navigation bar, click Administration and select Access Configuration,
then User Manager.
- In the Search text field, type SevOneStats.
- Select the check box for SevOneStats and click to display the Edit User pop-up.
- In the Email field, enter an email address. An email address is required here in order to save the changes you will be making in the next steps.
- Under Credentials, locate the Password text field and enter a new password.
- In the Confirm text field, reenter the password.
- At the bottom of the pop-up, select the User Enabled check box to enable the SevOneStats user account.
- If the Force password change on next login check box is selected, clear it. This is important because it will ensure that a password change is not required on the next login.
- Select the Custom Inactive Timeout check box to enable the user to stay logged on during periods of inactivity for the amount of minutes you enter in the Custom Inactive Timeout field. This setting overrides the Inactivity Timeout setting you enter on the Cluster Manager > Cluster Settings tab. Leave clear to have the user log off after the amount of time you enter on the Cluster Manager. The user must log out and then log back on for this setting to take effect.
- Select the Custom Hard Timeout Setting check box to enable and use customized Hard Timeout setting (for the user you are adding or editing) you define on the Cluster Manager > Cluster Settings tab > Security subtab.
- To customize the hard timeout value for a user, select the Custom Hard Timeout Value check box to enable editing the hard timeout value for the user that you are either adding or editing. Enable the checkbox to allow you to enter the number of minutes in value field the user can remain alive before SevOne NMS automatically logs them out of the application. The default value is 15 minutes. Value field can range between 5 minute to 86400 minutes (60 days). When Custom Hard Timeout Value is enabled, the timeout set in its value field is used for the user in add or edit mode instead of the Hard Timeout value set on the Cluster Manager > Cluster Settings tab > Security subtab.
- To prevent the password from expiring, select the check box for The password for this user will never expire.
- Click Save.Note: Verify credentials / Check permissions
Execute the following commands to verify the credentials.
$ podman --url=unix:/run/podman_sevone/podman.sock exec -it nms-nms-nms /bin/bash $ /usr/local/scripts/utilities/plugins/selfmon/testApiUser.php -u SevOneStats -p <password>
If the credentials are correct, it will return 1.
The Self-monitoring install process verifies that SevOneStats ID has the following permissions.
- Can view reports
- Can view alerts
- Can create devices
If SevOneStats has more or less permissions set, the installer will stop and warn about improper permission settings. To verify/modify the permissions, from the navigation bar, click Administration and select Access Configuration > User Manager. Search for SevOneStats user and set the permissions.
Add SevOne Appliance to Device Manager
To monitor your SevOne appliance or any other device, you need to add it to Device Manager
- Select the appliance that you would like to monitor your appliance from. You will be working from that appliance. For example, let's say you want to set up Appliance A to be monitored, and you have decided that Appliance B will do the monitoring. That means you will need to work from Appliance B.
- Now that the appliance is selected that will do the monitoring i.e., Appliance B, log into SevOne NMS on that appliance.
- From the navigation bar, click Devices and select Device
Manager.
- On the right, under Devices, click Add Device.
- At the top of the New Device page, in the Name field, enter the name of the appliance that you want to monitor.
- In the Alternate Name field, enter an alternate name for the appliance. Users can search for the appliance by this name.
- In the Description field, enter a description of the appliance. You can use this to provide additional information about the appliance, such as location, etc.
- In the IP Address field, enter the IP address of the appliance.
- The Allow Deletion check box appears to the admin user only and is selected by default. When selected, it enables users to delete the device. If you would like to prevent users from deleting the appliance as a device, clear the check box.
- Below that, click the Device Groups drop-down and select All Device Groups.
- Configure other settings as needed. For additional details, please refer to SevOne NMS User
Guide > sections New Device / Edit Device for adding / editing devices
respectively.Important: If Self-monitoring is installed prior to manually adding devices in SevOne NMS, those devices will get added automatically as a result of installing self-monitoring.
- Click Save As New.
- You will see a message at the top of the page to inform you that the device is being queued for discovery.
Enable Self-monitoring on Appliance
Execute the following steps to enable self-monitoring on the appliance you want to monitor.
- Using ssh, log into the appliance that you want to enable self-monitoring on.
- Run the install.sh script and follow the prompts.Note: Prior to installing self-monitoring, ensure that the appliance is in its final state, peered into the correct cluster, and has an IP address assigned to it.
$ podman --url=unix:/run/podman_sevone/podman.sock exec -it nms-nms-nms /bin/bash $ /usr/local/scripts/utilities/plugins/selfmon/install.sh
- You will see a couple of warnings followed by the text Press Ctrl-C to abort or any key to continue... Hit Enter to proceed.
- You will see one more message, Press Ctrl-C to abort or any key if you are sure... Hit Enter again.
- You will be prompted to Enter the password for the API user SevOneStats.Note: Now you will see some information about your appliance such as, number of peers, whether the appliance is part of an HSA pair, and the IP address used for it. After this, the installation process starts. You will see a long list of information, especially about objects and indicator types. Once done, you will receive confirmation that the installation is complete.
- Validate self-monitoring has been installed by checking the root user's crontab for the
following lines.
$ grep selfmon /etc/cron.d/selfmon */5 * * * * root /usr/local/scripts/utilities/plugins/selfmon/SevOneMon/sevone.deferreddata.poll.sevone --api 127.0.0.1 --device-id "1" --object 'SevOne Statistics' >> /var/log/SevOne/SevOneMon.log 2>&1 */5 * * * * root /usr/local/scripts/utilities/plugins/selfmon/MySQLMon/sevone.deferreddata.poll.mysql --api 127.0.0.1 --device-id "1" --database 127.0.0.1:3307 --object 'MySQL Config Database' --database-profile "config" >> /var/log/SevOne/MySQLMon.log 2>&1 */5 * * * * root /usr/local/scripts/utilities/plugins/selfmon/MySQLMon/sevone.deferreddata.poll.mysql --api 127.0.0.1 --device-id "1" --database 127.0.0.1:3306 --object 'MySQL Data Database' --database-profile "data" >> /var/log/SevOne/MySQLMon.log 2>&1 */5 * * * * root php /usr/local/scripts/utilities/plugins/selfmon/PolldMon/deferreddata.polld.php --api 127.0.0.1 --device-id "1" --polld-object 'SevOne-polld performance' --highpolld-object 'SevOne-highpolld performance' >> /var/log/SevOne/PolldMon.log 2>&1 */5 * * * * root php /usr/local/scripts/utilities/plugins/selfmon/RedisMon/sevone.deferreddata.poll.redis --api 127.0.0.1 --device-id "1" --object 'Redis Instance' >> /var/log/SevOne/RedisMon.log 2>&1
Important: It is best practice that a self-monitored appliance added as a device is not polled by itself; i.e., the appliance for which the self-monitoring is installed should not be the polling peer for the configured device. For additional details, please refer to SevOne NMS System Administration Guide > section Device Mover.
Create Object Rules
By creating object rules, you can make sure you are monitoring the objects you want to monitor on your SevOne appliance. You can also use object rules to exclude the objects that you do not want to monitor. Please refer to SevOne NMS System Administration Guide > section Object Rules for details on creating object rules.
What to Monitor
In this section, various components such as, CPU, Disk, etc. will be monitored. Along with this, focus will also be on the ideal ranges, signs of trouble, and some of the available indicators.
CPU
The main focus is on the total aggregate CPU usage. Looking at the sum of the CPU cores rather than at individual cores provides an overview of the CPU activity. If there is a problem with the total aggregate usage, it may be a good idea to start looking at individual cores to determine whether it is related to specific cores or all cores. This helps in determining whether the problem is related to a particular process or to the entire system.
Ideally, CPU usage will be in the following ranges.
- Idle time >= 50% - most of the time, the CPU should not be doing a whole lot.
- Waiting time <= 10% - if the waiting time is consistently 10% or higher, the system is too much in wait mode and must be looked into.
Disk
When looking at disks, the two main areas of focus are free disk space and input/output (I/O). The typical SevOne appliance has three main disk components.
- sda
- / - contains the entire operating system, including libraries, executables, etc.
- /index - is used by the database for indexing activities. Allows distribution of read/write requests across another disk to improve performance.
- sdb
- /data - contains large amounts of data (covering long time spans), flow data, etc.
- fioa
-
/ioDrive - optional Fusion-io SSD, included with higher-capacity appliances.
-
The following are the ideal amounts of free disk space.
- /, /ioDrive >= 20% free
-
/data >= 10% free - if this is too full, the database can freeze up.
SNMP reports I/O statistics for entire disks as well as the individual partitions of each disk. For self-monitoring purposes, the entire disk is of interest. The following are three I/O states.
- I/O is low - this means there is not much activity at the moment - so, nothing to worry about.
- I/O is up and down - this means that there is some read/write in progress - this is usual work and no cause for concern.
-
I/O is high - if it is consistently high, this means that the disk is constantly reading and writing. This is a red flag and requires a closer look.
Note: One possible cause of high I/O is hot standby synchronization, which involves copying lots of data from one appliance to another. However, it is always good to check into the exact cause of high I/O.
Memory
Memory applies to SevOne appliances. You will notice high memory usage. Depending on how much RAM an appliance has, it may be close to 100% memory usage. For appliances with greater RAM, memory usage will be high (due to Linux kernel caching) but not close to 100% usage. For example, if you write to a file and there is free space in RAM, the kernel will store the entire file in the RAM. So, when you read/write from/to the file, you are reading/writing from/to memory; making the process fast. The kernel then caches the file to the disk.
SevOne appliances ship with a lot of memory. Due to this, there is almost never a need to do the swapping. However, if a swap is required, it should not be over 1 GB. Before the swap, it is important to know why the memory needs to be swapped. The possible causes may include:
- Memory leak
- A poorly designed script
Used Memory
SNMP stats can be used to determine the amount of used memory. Along with indicator, Used Memory, you may also include Cached indicator and Buffers indicator to determine a more accurate usage of memory as some of it will be cached.
Used Memory
Used Memory + Buffers + Cached
Processes
All processes with SevOne as part of its name, must be monitored. Process monitoring can be done using Process Poller and SNMP.
Process Poller
The Process Poller provides statistics for processes related to the following.
- Apache
- Bash
- CRON scheduler
- MySQL
- PHP
- xStats
- etc.
Please refer to Process Poller monitoring information in SevOne NMS for a list of processes that are identified and monitored for the following information.
- Availability
- CPU Time
- Instances of the process
- System memory used - this refers to actual memory and does not include virtual memory.
SNMP
All major SevOne NMS processes connect to the database and issues queries. Every SevOne daemon exports SNMP information about its database use. You can look at the following daemons with SNMP.
- SevOne–datad
- SevOne-dispatchd
- SevOne-polld
- SevOne–requestd - provides percentage of requestd availability for each peer. An alert is raised when the availability is < 100%.
- SevOne-trapd
- and others...
SNMP indicators provide several database statistics for processes. The following are a few to keep a watch on.
- Query counts - any sudden changes in query counts may indicate a problem.
- Query errors - query errors may be the result of a schema or a code issue.
- Number of connections.
- Number of reconnections.
- Number of traps received, processed, etc. (applies to SevOne-trapd). Please make sure that the number of traps received is not much different from the number of traps processed. The numbers should be close to each other (within 100, for example). If the difference between traps received and traps processed is large, then you may be receiving more traps than the amount that can handled.
Hot Standby Appliance
A hot standby appliance (HSA) consists of an active peer and a passive peer. The active peer pulls configuration data from the Cluster Leader (in the case of a cluster) and polls data from objects and indicators on the devices that are assigned to it. The passive peer maintains a redundant copy of all configuration data and poll data. If you have a HSA, make sure that:
- MySQL replication is working. If Availability drops to 0%, then there may be an issue with the MySQL daemon, mysqld.
- the poller is running on the active appliance and not the passive appliance.
- there is a lot of database activity on the active appliance but not the passive appliance.
For HSA monitoring, SevOne recommends using the Deferred Data plugin. Select the SevOne Statistics object (for example, when using Instant Graphs) and the following indicators:
- Is master - indicates whether a peer is the leader.
- Is primary - indicates whether a peer is the primary.
If the passive peer of an HSA is reporting an Is master value of 1, this indicates that a switchover has occurred. If both the active peer and the passive peer are reporting an Is master value of 1, then a split-brain condition exists.
Interface
Because SevOne NMS monitors network devices, your SevOne appliance requires some bandwidth but not much. The normal range for a single appliance falls somewhere between 1 and 3%. For an HSA, you will see anywhere from two to four times as much bandwidth used. If the amount of bandwidth used is consistently high, this could indicate a problem, and you will need to look into what is causing the high bandwidth.
With an HSA, you will see peaks in traffic every two hours. This happens because data is being transferred from short-term storage to long-term storage, which results in additional MySQL replication.
The graph below displays bandwidth information using the indicators HC In Octets and HC Out Octets.
Ready for Self-monitoring
Create Policy
Policies allow you to receive alerts when condition(s) specified in the policy are triggered. From policies, you may define thresholds for the device and object groups. Please refer to SevOne NMS User Guide > section Policy Browser for details on creating / editing a policy.
In section Recommended Policies & Thresholds, you will find policies recommended for SevOne self-monitoring. Create a policy.
Create Threshold
Policies apply to entire device groups or object groups. Thresholds can be used to to receive alerts when conditions are triggered for a specific device. Please refer to SevOne NMS User Guide > section Threshold Browser for details on creating / editing a threshold.
In section Recommended Policies & Thresholds, you will find thresholds recommended. Create a threshold.
Alerts
To access the Alerts page from the navigation bar, click Events and select Alerts. The Alerts page allows you to look at the current, active alerts in the system. These include the threshold violations, trap notifications, and web site errors defined in the Policy Browser or the Threshold Browser.
Filters are optional.
Please refer to SevOne NMS User Guide > section Alerts for details.
Instant Graphs
To access the Instant Graphs page from the navigation bar, click Reports and select Instant Graphs. You may create statistical graphs for the objects and indicators on devices. Instant graphs are easy and fast to set up and allows you to immediately look at the potential problem areas.
Recommended Policies for Self-monitoring
This section includes the following four tables, which break down groups of policies by the component they apply to.
- SYSTEM POLICIES
- CORE PROCESS POLICIES
- CONFIGURATION POLICIES
- XSTATS POLICIES
Each table contains a list of SevOne recommended self-monitoring policies and includes the following information for each policy.
- Policy - name and description of the policy.
- Applies to - whether the policy applies to the PAS, HSA, DNC, or any combination of these.
- Plugin- plugin used when creating the policy.
- Alert Condition - specific information for setting up the policy trigger condition(s). This includes object type, severity level and the condition specifications.
- Clear Condition - specific information for setting up the policy clear condition(s). This includes the condition specifications.
- Suggested Remediation Steps - recommended steps to take in case a condition is violated.
- Details - any additional information.
How to use the table
The Plugin, Alert Condition, and Trigger Condition columns provide the specific information required for creating a policy.
Example
Policy: General Settings
Policy: Trigger Conditions
Type drop-down is set to Static by default. For most of the conditions in the table, you will not need to change the default. There are a few conditions in the table where Type is set to Baseline Delta, Baseline Percentage, etc., for example. These exceptions are noted in the Alert Condition column. Unless otherwise noted, the Type will be Static.
Policy: Clear Conditions
Recommended Policies & Thresholds
The following tables contain recommended policies for SevOne self-monitoring and are not automatically available with Self-monitoring installation.
Example
S1_SELFMON_Disk_UtilizationRed
Note: This is a Threshold rather than a Policy.
In the tables below, each policy that begins with '_' contains the prefix S1_SELFMON. For example,
- _memory_SwapRed must be read as S1_SELFMON_memory_SwapRed
- _memory_SwapYellow must be read as S1_SELFMON_memory_SwapYellow
SYSTEM POLICIES | ||||||
---|---|---|---|---|---|---|
Policy(prefix: S1_SELFMON) | Applies To | Plugin | Alert Condition | Clear Condition | Suggested Remediation Steps | Details |
_memory_SwapRed |
PAS HSA DNC |
SNMPPoller | Memory (Linux (Net-SNMP))Severity: AlertAverage Available Swap Memory <= 20% over 15 minutes | Average Available Swap Memory > 20% over 15 minutes |
See Details before executing the following command. killall -9 SevOne-requestd; podman exec -it nms-nms-nms /bin/bash; SevOne-act restart-mysql; |
WARNING: This command may cause the loss of one poll cycle, and the past two hours of data may be briefly unavailable in the GUI (it will come back once the command finishes). This command will flush out any processes hogging up memory and will relieve the swap memory. |
_memory_SwapYellow |
PAS HSA DNC |
SNMPPoller | Memory (Linux (Net-SNMP))Severity: WarningAverage Available Swap Memory <= 50% over 15 minutes | Average Available Swap Memory > 50% over 15 minutes |
Execute the following command: killall -9 SevOne-requestd; |
Consider running the command for SevOne_SwapRed (previous row). Otherwise, keep tabs on swap usage. |
_Disk_UtilizationYellow Note: This is a Threshold rather than a Policy. |
NMS (PAS, HSA) | SNMPPoller |
/ [specified partition] Severity: Warning Average Used Disk Space >= 80% over 15 minutes Note: For [specified partition], select the partition to monitor. Create a separate threshold for each partition that you plan to monitor. |
Average Used Disk Space < 80% over 15 minutes | N/A | Review settings in Administration > Cluster Manager > Cluster Settings > Storage. Specify lower number for Data Retention and Maximum Disk Utilization if possible. |
_Disk_UtilizationRed Note: This is a Threshold rather than a Policy. |
NMS (PAS, HSA) | SNMPPoller |
/ [specified partition] Severity: Alert Average Used Disk Space >= 90% over 15 minutes Note: For [specified partition], select the partition to monitor. Create a separate threshold for each partition that you plan to monitor. |
Average Used Disk Space < 90% over 15 minutes |
Execute the following command: trim-longterm --emergency-purge |
Use this policy for sd* objects, such as sda1, sdb, etc. This may indicate that the system is operating with a workload that is above or below its rated capacity. In case of system degradation, contact SevOne Support. |
_Disk_Reads | NMS (PAS, HSA) | SNMPPoller |
Disk IO (Linux (Net-SNMP))Severity: Warning Note: For Type, specifyBaseline Percentage. Average Number of reads > 150% of baseline over 50 minutes |
Average Number of reads < 150% of baseline over 50 minutes | N/A | Use this policy for sd* objects, such as sda1, sdb, etc. This may indicate that the system is operating with a workload that is above or below its rated capacity. In case of system degradation, contact SevOne Support. |
_Disk_Writes | NMS (PAS, HSA) | SNMP Poller |
Disk IO (Linux (Net-SNMP))Severity: Warning Note: For Type, specifyBaseline Percentage. Average Number of writes > 150% of baseline over 50 minutes |
Average Number of writes < 150% of baseline over 50 minutes | N/A | Use this policy for sd* objects, such as sda1, sdb, etc. This may indicate that the system is operating with a workload that is above or below its rated capacity. In case of system degradation, contact SevOne Support. |
_HSA_FAILOVER Important: For this policy, we recommend creating a device group for the HSAs and
applying the policy to that device group.
|
NMS (PAS, HSA) | Deferred Data |
SevOne Appliance Severity: Alert Average Is master = 1 over 5 minutes |
Average Is master = 0 over 5 minutes | ||
_Memory_Available | NMS (PAS, HSA) | SNMPPoller |
Physical Memory - Memory (Linux (Net-SNMP)) Severity: AlertIndicator: Available Memory With a trigger for Available Memory < 5% over 15 minutes and with Average Aggregation. |
Available Memory > 5% over the past 15 minutes with Minimum Aggregation. This will ensure that the alert is only cleared when the condition has remediated. | ||
_Ethernet_Traffic | NMS (PAS, HSA) | SNMPPoller |
InterfaceSeverity: Warning Note: For Type, specify Baseline Percentage.Rule 1Average HC In Octets > 150% of baseline over 15 minutes ORRule 2Average HC Out Octets > 150% of baseline over 15 minutes |
Average HC In Octets < 150% of baseline over 15 minutes ANDAverage HC Out Octets < 150% of baseline over 15 minutes |
Consider moving devices to a different appliance. Otherwise, contact SevOne Support. | |
_Ethernet_Errors | NMS (PAS, HSA) | SNMPPoller |
InterfaceSeverity: Warning Note: For Type, specify Baseline Percentage. Rule 1Average In Errors > 120% of baseline over 15 minutes ORRule 2Average Out Errors > 120% of baseline over 15 minutes OR Rule 3Average In Discards > 120% of baseline over 15 minutes ORRule 4Average Out Discards > 120% of baseline over 15 minutes |
Average In Errors < 120% of baseline over 15 minutes ANDAverage Out Errors < 120% of baseline over 15 minutes AND Average In Discards < 120% of baseline over 15 minutes ANDAverage Out Discards < 120% of baseline over 15 minutes |
||
_ALL_iDRAC_ICMPReachability Note: For this policy to work, you will need to add iDRACs to the Device Manager as separate devices, for example: Device 1 - SevOne (10.10.10.1) Device 2 - SevOne-idrac (10.10.10.2) |
NMS (PAS, HSA) | ICMPPoller | Ping DataSeverity: EmergencyAverage Availability < 95% over 15 minutes | Average Availability >= 95% over 15 minutes | Check network connectivity to the appliance. If the network is okay, contact SevOne Support. | |
_ALL_iDRAC_SNMPReachability | NMS (PAS, HSA) | SNMPPoller | SNMP AvailabilitySeverity: AlertAverage Availability < 95% over 15 minutes | Average Availability >= 95% over 15 minutes | If the iDRAC goes down, you will need to investigate server health, iDRAC connectivity, etc. | If iDRAC health, etc., looks good, contact SevOne Support. |
CORE PROCESS POLICIES | ||||||
---|---|---|---|---|---|---|
Policy (prefix: S1_SELFMON) |
Applies To | Plugin | Alert Condition | Clear Condition | Suggested Remediation Steps | Details |
_requestd_Availability |
PAS HSA DNC |
Process Poller | ProcessSubtype: SevOne-requestdSeverity: AlertAverage Availability < 100% over 15 minutes | Average Availability = 100% over 15 minutes |
Execute the following commands:
|
If the command fails, check port 60007 (TCP). |
_requestd_CPUTime |
PAS HSA DNC |
Process Poller |
Process Subtype: SevOne-requestdSeverity: AlertAverage CPU Time > 1000 Milliseconds over 15 minutes |
Average CPU Time < 1000 Milliseconds over 15 minutes | ||
_requestd_Memory |
PAS HSA DNC |
Process Poller |
Process Subtype: SevOne-requestdSeverity: ErrorAverage System memory used > 5 GB over 15 minutes |
Average System memory used < 5 GB over 15 minutes | N/A | Check network connectivity to the appliance. If the network is okay, contact SevOne Support. |
_apache2_Availability |
PAS HSA DNC |
Process Poller | ProcessSubtype: apache2Severity: AlertAverage Availability < 100% over 15 minutes | Average Availability = 100% over 15 minutes |
N/A |
If the command fails, check firewall access on ports 80 and 443 (TCP). |
_apache2_CPUTime |
PAS HSA DNC |
Process Poller |
ProcessSubtype: apache2Severity: AlertNote: For Type, specify Baseline Percentage. Average CPU Time > 150% of baseline over 15 minutes |
Average CPU Time < 150% of baseline over 15 minutes | ||
_apache2_Memory |
PAS HSA DNC |
Process Poller |
ProcessSubtype: apache2Severity: Alert Note: For Type, specify Baseline Percentage.Average System memory used > 150% of baseline over 15 minutes |
Average System memory used < 150% of baseline over 15 minutes | ||
_apache2_ThreadCount |
PAS HSA DNC |
Process Poller |
ProcessSubtype: apache2Severity: Alert Note: For Type, specify Baseline Percentage.Average Instances of the process > 150% of baseline over 15 minutes |
Average System memory used < 150% of baseline over 15 minutes | ||
_polld_Availability |
PAS HSA DNC |
Process Poller | ProcessSubtype: SevOne-polldSeverity: AlertAverage Availability < 100% over 15 minutes | Average Availability = 100% over 15 minutes |
Execute the following commands:
|
If the command fails, contact SevOne Support. |
_polld_CPUTime |
PAS HSA DNC |
Process Poller |
Process Subtype: SevOne-polldSeverity: Alert Note: For Type, specify Baseline Percentage. Average CPU Time > 150% of baseline over 15 minutes |
Average CPU Time < 150% of baseline over 15 minutes | ||
_polld_Memory |
PAS HSA DNC |
Process Poller |
Process Subtype: SevOne-polldSeverity: Alert Note: For Type, specify Baseline Percentage. Average System memory used > 150% of baseline over 15 minutes |
Average System memory used < 150% of baseline over 15 minutes | ||
_polld_ThreadCount |
PAS HSA DNC |
Process Poller |
ProcessSubtype: SevOne-polldSeverity: Alert Note: For Type, specify Baseline Percentage. Average Instances of the process > 150% of baseline over 15 minutes OR Average Instances of the process< 50% of baseline over 15 minutes |
Average Instances of the process < 150% of baseline over 15 minutes
AND
Average Instances of the process> 50% of baseline over 15 minutes |
||
_mysqlData_Availability Note: The object subtype for this policy, mysqld, applies to both the MySQL Config
database and the MySQL Data database.
|
PAS HSA DNC |
Process Poller | ProcessSubtype: mysqldSeverity: AlertAverage Availability < 100% over 15 minutes | Average Availability = 100% over 15 minutes |
Execute the following commands:
|
If the command fails, check firewall access on ports 3306 and 3307 (TCP). |
_mysqldData_CPUTime Note: The object subtype for this policy, mysqld, applies to both the MySQL Config
database and the MySQL Data database.
|
PAS HSA DNC |
Process Poller |
ProcessSubtype: mysqldSeverity: Alert Note: For Type, specify Baseline Percentage. Average CPU Time > 150% of baseline over 15 minutes |
Average CPU Time < 150% of baseline over 15 minutes | ||
_mysqlData_Memory Note: The object subtype for this policy, mysqld, applies to both the MySQL Config
database and the MySQL Data database.
|
PAS HSA DNC |
Process Poller |
ProcessSubtype: mysqldSeverity: Alert Note: For Type, specify Baseline Percentage. Average System memory used > 150% of baseline over 15 minutes |
Average System memory used < 150% of baseline over 15 minutes | ||
_mysqlData_ThreadCount Note: The object subtype for this policy, mysqld, applies to both the MySQL Config
database and the MySQL Data database.
|
PAS HSA DNC |
Process Poller |
ProcessSubtype: mysqldSeverity: Alert Note: For Type, specify Baseline Percentage. Average Instances of the process > 150% of baseline over 15 minutes OR Average Instances of the process< 50% of baseline over 15 minutes |
Average Instances of the process < 150% of baseline over 15 minutes
AND
Average Instances of the process> 50% of baseline over 15 minutes |
||
_sshd_Availability |
PAS HSA DNC |
Process Poller | ProcessSubtype: sshdSeverity: AlertAverage Availability < 100% over 15 minutes | Average Availability = 100% over 15 minutes |
Execute the following commands:
|
If the command fails, check firewall access on port 22 (TCP). |
_sshd_CPUTime |
PAS HSA DNC |
Process Poller |
Process Subtype: sshdSeverity: Alert Note: For Type, specify Baseline Percentage. Average CPU Time > 150% of baseline over 15 minutes |
Average CPU Time < 150% of baseline over 15 minutes | ||
_sshd_Memory |
PAS HSA DNC |
Process Poller |
Process Subtype: sshdSeverity: Alert Note: For Type, specify Baseline Percentage. Average System memory used > 150% of baseline over 15 minutes |
Average System memory used < 150% of baseline over 15 minutes | ||
_crond_Availability |
PAS HSA DNC |
Process Poller |
Process Subtype: crondSeverity: AlertAverage Availability < 100% over 15 minutes |
Average Availability = 100% over 15 minutes | ||
_crond_CPUTime |
PAS HSA DNC |
Process Poller |
Process Subtype: crondSeverity: Alert Note: For Type, specify Baseline Percentage. Average CPU Time > 150% of baseline over 15 minutes |
Average CPU Time < 150% of baseline over 15 minutes | ||
_crond_Memory |
PAS HSA DNC |
Process Poller |
Process Subtype: crondSeverity: Alert Note: For Type, specify Baseline Percentage. Average System memory used > 150% of baseline over 15 minutes |
Average System memory used < 150% of baseline over 15 minutes | ||
_datad_Availability |
PAS HSA DNC |
Process Poller | ProcessSubtype: SevOne-datadSeverity: AlertAverage Availability < 100% over 15 minutes | Average Availability = 100% over 15 minutes |
Execute the following commands:
|
If the command fails, contact SevOne Support. |
_datad_CPUTime |
PAS HSA DNC |
Process Poller |
Process Subtype: SevOne-datadSeverity: Alert Note: For Type, specify Baseline Percentage. Average CPU Time > 150% of baseline over 15 minutes |
Average CPU Time < 150% of baseline over 15 minutes | ||
_datad_Memory |
PAS HSA DNC |
Process Poller |
Process Subtype: SevOne-datadSeverity: Alert Note: For Type, specify Baseline Percentage. Average System memory used > 150% of baseline over 15 minutes |
Average System memory used < 150% of baseline over 15 minutes | ||
_masterslaved_Availability |
PAS HSA DNC |
Process Poller |
Process Subtype: SevOne-masterslavedSeverity: AlertAverage Availability < 100% over 15 minutes |
Average Availability = 100% over 15 minutes |
Execute the following commands:
|
If the command fails, check sshd server on port 60006 (TCP). |
_masterslaved_CPUTime |
PAS HSA DNC |
Process Poller |
Process Subtype: SevOne-masterslavedSeverity: Alert Note: For Type, specify Baseline Percentage. Average CPU Time > 150% of baseline over 15 minutes |
Average CPU Time < 150% of baseline over 15 minutes | ||
_masterslaved_Memory |
PAS HSA DNC |
Process Poller |
Process Subtype: SevOne-masterslavedSeverity: Alert Note: For Type, specify Baseline Percentage. Average System memory used > 150% of baseline over 15 minutes |
Average System memory used < 150% of baseline over 15 minutes | ||
_ntpd_Availability |
PAS HSA DNC |
Process Poller | ProcessSubtype: ntpdSeverity: AlertAverage Availability < 100% over 15 minutes | Average Availability = 100% over 15 minutes |
Execute the following command: systemctl start ntpd |
If the command fails, check firewall access on port 123 (UDP). |
_ntpd_CPUTime |
PAS HSA DNC |
Process Poller |
Process Subtype: ntpdSeverity: Alert Note: For Type, specify Baseline Percentage. Average CPU Time > 150% of baseline over 15 minutes |
Average CPU Time < 150% of baseline over 15 minutes | ||
_ntpd_Memory |
PAS HSA DNC |
Process Poller |
Process Subtype: ntpdSeverity: Alert Note: For Type, specify Baseline Percentage. Average System memory used > 150% of baseline over 15 minutes |
Average System memory used < 150% of baseline over 15 minutes | ||
_netflowd_Availability | DNC | Process Poller |
Process Subtype: SevOne-netflowdSeverity: AlertAverage Availability < 100% over 15 minutes |
Average Availability = 100% over 15 minutes |
Execute the following commands:
|
If the command fails, contact SevOne Support. |
_netflowd_CPUTime | DNC | Process Poller |
Process Subtype: SevOne-netflowdSeverity: Alert Note: For Type, specify Baseline Percentage. Average CPU Time > 150% of baseline over 15 minutes |
Average CPU Time < 150% of baseline over 15 minutes | ||
_netflowd_Memory | DNC | Process Poller |
Process Subtype: SevOne-netflowdSeverity: Alert Note: For Type, specify Baseline Percentage. Average System memory used > 150% of baseline over 15 minutes |
Average System memory used < 150% of baseline over 15 minutes | ||
Selfmond: peer is not reachable via SSH |
PAS HSA DNC |
SNMP Poller |
One or more peers are not reachable via SSH. Object Type: SevOne Process (Linux (Net-SNMP)) Severity: Emergency AllPeersReachable changed to 0 over 1 minute |
No | ||
Selfmond: updater has fallen behind |
PAS HSA DNC |
SNMP Poller |
Updater has fallen behind. Object Type: SevOne Process (Linux (Net-SNMP)) Severity: Emergency Average SecondsSinceLastUpdate > 10800 over 1 minute |
No | ||
Selfmond: master-slave both master |
PAS HSA DNC |
SNMP Poller | Both appliances think that they are leader.Object Type: SevOne Process (Linux (Net-SNMP))Severity: EmergencyAverage BothMaster >= 5 minutes | No | ||
Selfmond: master-slave active appliance |
PAS HSA DNC |
SNMP Poller |
Active appliance is changed. Object Type: SevOne Process (Linux (Net-SNMP)) Severity: Emergency ActiveAppliance changed over 5 minutes |
No | ||
Selfmond: Config Db replication is too much behind |
PAS HSA DNC |
SNMP Poller |
Config Db replication is too much behind for normal operations. Object Type: SevOne Process (Linux (Net-SNMP)) Severity: Emergency Average SecondsBehindMaster > 300 over 5 minutes |
No | ||
Selfmond: Data Db replication is too much behind |
PAS HSA DNC |
SNMP Poller |
Data Db replication is too much behind for normal operations. Object Type: SevOne Process (Linux (Net-SNMP)) Severity: Emergency Average SecondsBehindMaster > 300 over 5 minutes |
No | ||
Selfmond: / mount point become read-only |
PAS HSA DNC |
SNMP Poller |
/data mount point become read-only. Object Type: SevOne Process (Linux (Net-SNMP)) Severity: Emergency Average dataMountPoint = 2 over 1 minute |
No | ||
Selfmond: /data mount point become read-only |
PAS HSA DNC |
SNMP Poller |
/data mount point become read-only. Object Type: SevOne Process (Linux (Net-SNMP)) Severity: Emergency Average dataMountPoint = 2 over 1 minute |
No | ||
Selfmond: /iodrive mount point become read-only |
PAS HSA DNC |
SNMP Poller |
/iodrive mount point become read-only. Object Type: SevOne Process (Linux (Net-SNMP)) Severity: Emergency Average iodriveMountPoint = 2 over 1 minute |
No | ||
Selfmond: no polled indicators |
PAS HSA DNC |
SNMP Poller | Object Type: SevOne Process (Linux (Net-SNMP))Severity: Emergencypolled indicators per seconds are equal to 0 for over 5 minutes | No | ||
Selfmond: report mailer has fallen behind |
PAS HSA DNC |
SNMP Poller | Object Type: SevOne Process (Linux (Net-SNMP))Severity: Emergencyreport mailer has fallen 300 seconds after it is supposed to be sent | No | ||
Selfmond: keys expired |
PAS HSA DNC |
SNMP Poller | Object Type: SevOne Process (Linux (Net-SNMP))Severity: EmergencyWhen a number of expired crypto keys is > 0 for more than 1 minute | No |
In NMS console on any peer enter into the NMS container. Execute the following commands.
where, <USER ID> is the user with expired crypto key |
Crypto keys are used for decrypting sensitive information obtained by REST API and are generated in NMS console by SevOne-act activate-crypto-permissions |
Selfmond: SevOne-netflowd has fallen behind |
PAS HSA DNC |
SNMP Poller |
Object Type: SevOne Process (Linux (Net-SNMP)) Severity: Emergency Greater than 10 minutes for over 1 minute. |
No | ||
Selfmond: SevOne-ffupdater has fallen behind |
PAS HSA DNC |
SNMP Poller | Object Type: SevOne Process (Linux (Net-SNMP))Severity: EmergencyGreater than 90 minutes for over 1 minute. | No |
CONFIGURATION POLICIES | ||||||
---|---|---|---|---|---|---|
Policy(prefix: S1_SELFMON) | Applies To | Plugin | Alert Condition | Clear Condition | Suggested Remediation Steps | Details |
_Managed_Objects |
PAS HSA DNC |
Deferred Data |
SevOne ApplianceSeverity: Alert Note: For Type, specify Baseline Percentage. Average Total objects enabled > 20% of baseline over 60 minutes OR Average Total objects enabled < 20% of baseline over 60 minutes |
No clear condition. If the alert condition triggers, there is likely a serious problem. The alert should be cleared manually once the issue has been addressed. | ||
_Disabled_Objects |
PAS HSA DNC |
Deferred Data |
SevOne ApplianceSeverity: Alert Note: For Type, specify Baseline Percentage. Average Total objects disabled > 20% of baseline over 60 minutes OR Average Total objects disabled < 20% of baseline over 60 minutes |
No clear condition. If the alert condition triggers, there is likely a serious problem. The alert should be cleared manually once the issue has been addressed. |
XSTATS POLICIES | ||||||
---|---|---|---|---|---|---|
Policy(prefix: S1_SELFMON) | Applies To | Plugin | Alert Condition | Clear Condition | Suggested Remediation Steps | Details |
_bulkd_Availability (xStats Only) | PAS | Process Poller |
Process Subtype: SevOne-bulkd (process)Severity: AlertAverage Availability < 100% over 15 minutes |
Average Availability = 100% over 15 minutes | ||
_bulkd_CPUTime (xStats Only) | PAS | Process Poller |
Process Subtype: SevOne-bulkd (process)Severity: Alert Note: For Type, specify Baseline Percentage. Average CPU Time > 150% of baseline over 15 minutes |
Average CPU Time < 150% of baseline over 15 minutes | ||
_bulkd_Memory (xStats Only) | PAS | Process Poller |
Process Subtype: SevOne-bulkd (process)Severity: Alert Note: For Type, specify Baseline Percentage. Average System memory used > 150% of baseline over 15 minutes |
Average System memory used < 150% of baseline over 15 minutes | ||
_bulkd_ThreadCount (xStats Only) | PAS | Process Poller |
Process Subtype: SevOne-bulkd (process)Severity: Alert Note: For Type, specify Baseline Percentage. Average Instances of the process > 150% of baseline over 15 minutes OR Average Instances of the process < 50% of baseline over 15 minutes |
Average Instances of the process < 150% of baseline over 15 minutes AND Average Instances of the process > 50% of baseline over 15 minutes |
||
_fcad_Availability (xStats Only) | PAS | Process Poller |
Process Subtype: SevOne-fcadSeverity: AlertAverage Availability < 75% over 15 minutes |
Average Availability > 75% over 15 minutes | ||
_fcad_CPUTime (xStats Only) | PAS | Process Poller |
Process Subtype: SevOne-fcadSeverity: Alert Note: For Type, specify Baseline Percentage. Average CPU Time > 120% of baseline over 15 minutes |
Average CPU Time < 120% of baseline over 15 minutes | ||
_fcad_Memory (xStats Only) | PAS | Process Poller |
Process Subtype: SevOne-fcadSeverity: Alert Note: For Type, specify Baseline Percentage. Average System memory used > 150% of baseline over 15 minutes |
Average System memory used < 150% of baseline over 15 minutes | ||
_fcad_ThreadCount (xStats Only) | PAS | Process Poller |
Process Subtype: SevOne-fcadSeverity: Alert Note: For Type, specify Baseline Percentage. Average Instances of the process > 150% of baseline over 15 minutes OR Average Instances of the process < 50% of baseline over 15 minutes |
Average Instances of the process < 150% of baseline over 15 minutes AND Average Instances of the process > 50% of baseline over 15 minutes |
Uninstall Self-monitoring
There are a handful of instances when it may become necessary to uninstall the Self-monitoring components. Removal and re-installation is the most efficient way to update Self-monitoring in the event of an IP address change, SevOneStats user password change, etc. Users must ensure that their root user's crontab is backed up prior to executing the following commands as this script makes modifications to the crontab.
- Backup root user's
crontab.
$ crontab -l > /root/root_crontab_backup-`date +%s`.txt
- Change directory to
/usr/local/scripts/utilities/plugins/selfmon.
$ cd /usr/local/scripts/utilities/plugins/selfmon
- Run uninstall.sh script and follow the prompts OR run the script in a single line as
shown
below.
$ (echo -e '\n') | /usr/local/scripts/utilities/plugins/selfmon/uninstall.sh
- Validate that the root user's crontab no longer contains the following
lines.
*/5 * * * * /usr/local/scripts/utilities/plugins/selfmon/SevOneMon/sevone.deferreddata.poll.sevone --api 127.0.0.1 --device-id "2" --object 'SevOne Statistics' >> /var/SevOne/SevOneMon.log 2>&1 */5 * * * * /usr/local/scripts/utilities/plugins/selfmon/MySQLMon/sevone.deferreddata.poll.mysql --api 127.0.0.1 --device-id "2" --database 127.0.0.1:3307 --object 'MySQL Config Database' --database-profile "config" >> /var/SevOne/MySQLMon.log 2>&1 */5 * * * * /usr/local/scripts/utilities/plugins/selfmon/MySQLMon/sevone.deferreddata.poll.mysql --api 127.0.0.1 --device-id "2" --database 127.0.0.1:3306 --object 'MySQL Data Database' --database-profile "data" >> /var/SevOne/MySQLMon.log 2>&1 */5 * * * * php /usr/local/scripts/utilities/plugins/selfmon/PolldMon/deferreddata.polld.php --api 127.0.0.1 --device-id "2" --polld-object 'SevOne-polld performance' --highpolld-object 'SevOne-highpolld performance' >> /var/SevOne/PolldMon.log 2>&1 */5 * * * * php /usr/local/scripts/utilities/plugins/selfmon/RedisMon/sevone.deferreddata.poll.redis --api 127.0.0.1 --device-id "2" --object 'Redis Instance' >> /var/SevOne/RedisMon.log 2>&1
Troubleshooting
Changed SevOneStats password after enabling Self-monitoring. Now, it is not working
SevOne strongly suggests against changing the password for the API user SevOneStats after self-monitoring is enabled. However, if you have changed the password once self-monitoring is already in progress and you are experiencing problems, please contact the SevOne Support team for assistance.