Daily monitoring checklist

To ensure that you are completing the daily monitoring tasks for your IBM Spectrum Protect solution, review the daily monitoring checklist.

Complete the daily monitoring tasks from the Operations Center Overview page. You can access the Overview page by opening the Operations Center and clicking Overviews.
The following figure shows the location for completing each task.
The image is a graphical depiction of the Overview page, and provides the location for each task in the checklist.
Tip: To run administrative commands for advanced monitoring tasks, use the Operations Center command builder. The command builder provides a type-ahead function to guide you as you enter commands. To open the command builder, go to the Operations Center Overview page. On the menu bar, hover over the settings icon and click Command Builder.

The following table lists the daily monitoring tasks and provides instructions for completing each task.

Table 1. Daily monitoring tasks
Task Basic procedures Advanced procedures and troubleshooting information
Watch for security notifications, which can indicate a ransomware attack. If a potential ransomware attack is detected in the IBM Spectrum Protect environment, a security notification message is displayed in the foreground of the Operations Center. For more information, click the message to open the Security Notifications page. On the Security Notifications page, you can take the following actions:
  • View notification details by client.
    Restriction: Notifications are available only for backup-archive clients and IBM Spectrum Protect for Virtual Environments clients.
  • Acknowledge a security notification by selecting it and clicking Acknowledge. When you acknowledge a security notification, a check mark is added to the Acknowledged column of the Security Notifications page for the selected client. The standard by which a notification is acknowledged is determined by your organization. A check mark might mean that you investigated the issue and determined that it is a false positive. Or it might mean that a problem exists and is being resolved.
  • Assign a security notification to an administrator by selecting the security notification and clicking Assign. To view the assignment, the administrator must sign in to the Operations Center and click Overviews > Security. If you are not certain whether the administrator regularly monitors the Security Notifications page, notify the administrator about the assignment.
  • If the notification is a false positive, you can select the security notification and click Reset. The security notification is deleted. Historical data that is used for baseline comparisons with the most recent backup operation is deleted. A new baseline is calculated going forward.
In the illustration of the Overview page, the number 1 corresponds to the Clients area. Determine whether clients are at risk of being unprotected due to failed or missed backup operations. To verify whether clients are at risk, in the Clients area, look for an At risk notification. To view details, click the Clients area.
Attention: If the At risk percentage is much greater than usual, it might indicate a ransomware attack. A ransomware attack can cause backup operations to fail, thus placing clients at risk. For example, if the percentage of clients at risk is normally between 5% and 10%, but the percentage increases to 40% or 50%, investigate the cause.
If you installed the client management service on a backup-archive client, you can view and analyze the client error and schedule logs by completing the following steps:
  1. In the Clients table, select the client and click Details.
  2. To diagnose an issue, click Diagnosis.
For clients that do not have the client management service installed, access the client system to review the client error logs.
In the illustration of the Overview page, the number 2 corresponds to the Alerts area. Determine whether client-related or server-related errors require attention. To determine the severity of any reported alerts, in the Alerts area, hover over the columns. To view additional information about alerts, complete the following steps:
  1. Click the Alerts area.
  2. In the Alerts table, select an alert.
  3. In the Activity Log pane, review the messages. The pane displays related messages that were issued before and after the selected alert occurred.
In the illustration of the Overview page, the number 3 corresponds to the Servers area. Determine whether servers that are managed by the Operations Center are available to provide data protection services to clients.
  1. To verify whether servers are at risk, in the Servers area, look for an Unavailable notification.
  2. To view additional information, click the Servers area.
  3. Select a server in the Servers table and click Details.
Tip: If you detect an issue that is related to server properties, update the server properties:
  1. In the Servers table, select a server and click Details.
  2. To update server properties, click Properties.
In the illustration of the Overview page, the number 4 corresponds to the Inventory area. Determine whether sufficient space is available for the server inventory, which consists of the server database, active log, and archive log.
  1. Click the Servers area.
  2. In the Status column of the table, view the status of the server and resolve any issues:
    • Normal The icon is a check mark. Sufficient space is available for the server database, active log, and archive log.
    • Critical The icon is a circle with an X mark. Insufficient space is available for the server database, active log, or archive log. You must add space immediately, or the data protection services that are provided by the server will be interrupted.
    • Warning The icon is a triangle with an exclamation mark. The server database, active log, or archive log is running out of space. If this condition persists, you must add space.
    • Unavailable The icon resembles a cracked ball. Status cannot be obtained. Ensure that the server is running, and that there are no network issues. This status is also shown if the monitoring administrator ID is locked or otherwise unavailable on the server. This ID is named IBM®-OC-hub_server_name.
    • Unmonitored The icon is a question mark in a diamond. Unmonitored servers are defined to the hub server, but are not configured for management by the Operations Center. To configure an unmonitored server, select the server, and click Monitor Spoke.
You can also look for related alerts on the Alerts page. For additional instructions about troubleshooting, see Resolving server problems.
In the illustration of the Overview page, the number 5 corresponds to the Db2 area. Verify server database backup operations. To determine when a server was most recently backed up, complete the following steps:
  1. Click the Servers area.
  2. In the Servers table, review the Last Database Backup column.
To obtain more detailed information about backup operations, complete the following steps:
  1. In the Servers table, select a row and click Details.
  2. In the DB Backup area, hover over the check marks to review information about backup operations.
If a database was not backed up recently (for example, in the last 24 hours), you can start a backup operation:
  1. On the Operations Center Overview page, click the Servers area.
  2. In the table, select a server and click Back Up.
To determine whether the server database is configured for automatic backup operations, complete the following steps:
  1. On the menu bar, hover over the settings icon and click Command Builder.
  2. Issue the QUERY DB command:
    query db f=d
  3. In the output, review the Full Device Class Name field. If a device class is specified, the server is configured for automatic database backups.
In the illustration of the Overview page, the number 6 corresponds to the Servers menu. Monitor other server maintenance tasks. Server maintenance tasks can include running administrative command schedules, maintenance scripts, and related commands. To search for information about processes that failed because of server issues, complete the following steps:
  1. Click Servers > Maintenance.
  2. To obtain the two-week history of a process, view the History column.
  3. To obtain more information about a scheduled process, hover over the check box that is associated with the process.
For more information about monitoring processes and resolving issues, see the Operations Center online help.
In the illustration of the Overview page, the number 7 corresponds to the Activity area. Verify that the amount of data that was recently sent to and from servers is within the expected range.
  • To obtain an overview of activity in the last 24 hours, view the Activity area.
  • To compare activity in the last 24 hours with activity in the previous 24 hours, review the figures in the Current® and Previous areas.
  • If more data was sent to the server than you expected, determine which clients are backing up more data and investigate the cause. It is possible that client-side data deduplication is not working correctly.
    Attention: If the amount of backed-up data is significantly larger than usual, it might indicate a ransomware attack. When ransomware encrypts data, the system perceives the data as being changed, and the changed data is backed up. Thus, backup volumes become larger. To determine which clients are affected, click the Applications, Virtual Machines, or Systems tab.
  • If less data was sent to the server than you expected, investigate whether client backup operations are proceeding on schedule.
In the illustration of the Overview page, the number 8 corresponds to the Pools area. Verify that storage pools are available to back up client data.
  1. If problems are indicated in the Storage & Data Availability area, click Pools to view the details:
    • If the Critical The icon is a circle with an X mark. status is displayed, insufficient space is available in the storage pool, or its access status is unavailable.
      Attention: If the status is critical, investigate the cause:
      • If the data deduplication rate for a storage pool drops significantly, it might indicate a ransomware attack. During a ransomware attack, data is encrypted and cannot be deduplicated. To verify the data deduplication rate, in the Storage Pools table, review the value in the % Savings column.
      • If a storage pool unexpectedly becomes 100% utilized, it might indicate a ransomware attack. To verify the utilization, review the value in the Capacity Used column. Hover over the values to see the percentages of used and free space.
    • If the Warning The icon is a triangle with an exclamation mark. status is displayed, the storage pool is running out of space, or its access status is read-only.
  2. To view the used, free, and total space for your selected storage pool, hover over the entries in the Capacity Used column.

To view the storage-pool capacity that was used over the past two weeks, select a row in the Storage Pools table and click Details.

In the illustration of the Overview page, the number 9 corresponds to the Devices areas.Verify that storage devices are available for backup operations. In the Storage & Data Availability area, in the Volumes section, under the capacity bars, review the status that is reported next to Devices. If a Critical The icon is a circle with an X mark. or Warning The icon is a triangle with an exclamation mark. status is displayed for any device, investigate the issue. To view details, click Devices. Disk devices might have a critical or warning status for the following reasons:
  • For DISK device classes, volumes might be offline or have a read-only access status. The Disk Storage column of the Disk Devices table shows the state of volumes.
  • For FILE device classes that are not shared, directories might be offline. Also, insufficient free space might be available for allocating scratch volumes. The Disk Storage column of the Disk Devices table shows the state of directories.
  • For FILE device classes that are shared, drives might be unavailable. A drive is unavailable if it is offline, if it stopped responding to the server, or if its path is offline. Other columns of the Disk Devices table show the state of the drives and paths.
In the illustration of the Overview page, the number 10 corresponds to the Replication area.Monitor node replication processes.
  1. To obtain the overall status of node replication processes, view the Replication area on the Operations Center Overview page.
  2. To view information about each replicated server pair, click the Replication area.
    Attention: If you notice an unexpected increase in the number of replication failures, it might indicate a ransomware attack. Investigate the cause of the failures.
  3. To view the amount of data that was replicated over the last two weeks and the speed of replication, select a server pair and click Details.
  4. To view replication information for a client, on the Operations Center Overview page, click Clients. View the information in the Replication Workload column.
    Attention: If you see a drastic, unexpected increase in the replication workload, it might indicate a ransomware attack. Investigate the cause of the increased workload.
For advanced monitoring, view information about running and ended node replication processes by using commands:
  1. On the Operations Center Overview page, hover over the settings icon and click Command Builder.
  2. Issue the QUERY REPLICATION command. For instructions, see QUERY REPLICATION (Query node replication processes). If the replication operation was completed successfully, the Total Files To Replicate value matches the Total Files Replicated value.
To display messages that are related to a node replication process on a source or target replication server, complete the following steps:
  1. On the Operations Center Overview page, click Servers.
  2. Select the source or target replication server and click Details:
    • To view active tasks, click Active Tasks, select the task, and verify that the Running status is displayed. For details, view the related activity logs.
    • To view completed tasks, click Completed Tasks, select the task, and ensure that the Completed status is displayed. For details, view the related activity logs.
In the illustration of the Overview page, the number 11 corresponds to the Retention Sets area.Monitor retention sets. To obtain the overall status of retention sets, view the Retention Sets area on the Operations Center Overview page:
  • The Completed field specifies the number of retention sets that were created in the server database and are tracked in the server inventory.
  • The Expired field specifies the number of retention sets whose data is expired.
  • The Deleted field specifies the number of retention sets that were deleted.
To view or modify retention rules, click Services > Retention Rules.
For more information about retention sets, click the Retention Sets area to open the Retention Sets page. To view or modify retention set properties, double-click a retention set.
For more detailed information, you can run related commands:
  1. On the Operations Center Overview page, hover over the settings icon and click Command Builder.
  2. To determine which retention set creation jobs are running, interrupted, or completed, run the QUERY JOB command. For instructions, see QUERY JOB (Query a retention set creation job).
  3. To query retention rules, run the QUERY RETRULE command. For instructions, see QUERY RETRULE (Query a retention rule).
  4. To query retention sets, run the QUERY RETSET command. For instructions, see QUERY RETSET (Query a retention set).
  5. To query retention set contents, run the QUERY RETSETCONTENTS command. For instructions, see QUERY RETSETCONTENTS (Query the contents of a retention set).
In the illustration of the Overview page, the number 12 corresponds to the Tiering Rules area. Monitor tiering rules. To obtain the overall status of tiering operations, view the Tiering Rules area on the Operations Center Overview page.
The status summary shows the most recent processing results for each tiering rule. The number of tiering rules in each of the following states is shown:
Normal status icon Normal
The number of tiering rules that ran without errors. Eligible data was successfully tiered according to the rule's specifications. The tiering process was completed within the rule's time limit.
Warning status icon Warning
The number of tiering rules that completed processing, but did not tier all eligible data. Either some files were skipped by the tiering process, the rule's time limit was reached, or the tiering process was canceled.
The icon is a circle with an X mark. Failed
The number of tiering rules that did not complete processing. The server could not tier data. For example, the server might be unable to tier data because the target storage pool has insufficient space or because the server cannot access the storage pool.
The icon is a question mark in a diamond. Other states
The number of tiering rules that have other states. The server on which the tiering rule is defined might be unavailable to provide the data, or might be running an earlier version of IBM Spectrum Protect that does not support status. Status might not be applicable because the tiering rule is not activated, or the tiering rule was not run.