Daily monitoring checklist

Review the checklist to ensure that you complete important daily monitoring tasks.

Complete the daily monitoring tasks from the Operations Center Overview page. You can access the Overview page by opening the Operations Center and clicking Overviews.

The following figure shows the location for completing each task.

The image is a graphical depiction of the Overview page, and provides the location for each task in the checklist.

Tip: To run administrative commands for advanced monitoring tasks, use the Operations Center command builder. The command builder provides a type-ahead function to guide you as you enter commands. To open the command builder, go to the Operations Center Overview page. On the menu bar, hover over the settings icon and click Command Builder.

The following table lists the daily monitoring tasks and provides instructions for completing each task.

Table 1. Daily monitoring tasks
Task Basic procedures Advanced procedures and troubleshooting information
In the illustration of the Overview page, the number 1 corresponds to the Clients area. Determine whether clients are at risk of being unprotected due to failed or missed backup operations. To verify whether clients are at risk, in the Clients area, look for an At risk notification. To view details, click the Clients area.
Attention: If the At risk percentage is much greater than usual, it might indicate a ransomware attack. A ransomware attack can cause backup operations to fail, thus placing clients at risk. For example, if the percentage of clients at risk is normally between 5% and 10%, but the percentage increases to 40% or 50%, investigate the cause.
If you installed the client management service on a backup-archive client, you can view and analyze the client error and schedule logs by completing the following steps:
  1. In the Clients table, select the client and click Details.
  2. To diagnose an issue, click Diagnosis.
For clients that do not have the client management service installed, access the client system to review the client error logs.
In the illustration of the Overview page, the number 2 corresponds to the Alerts area. Determine whether client-related or server-related errors require attention. To determine the severity of any reported alerts, in the Alerts area, hover over the columns. To view additional information about alerts, complete the following steps:
  1. Click the Alerts area.
  2. In the Alerts table, select an alert.
  3. In the Activity Log pane, review the messages. The pane displays related messages that were issued before and after the selected alert occurred.
In the illustration of the Overview page, the number 3 corresponds to the Servers area. Determine whether servers that are managed by the Operations Center are available to provide data protection services to clients.
  1. To verify whether servers are at risk, in the Servers area, look for an Unavailable notification.
  2. To view additional information, click the Servers area.
  3. Select a server in the Servers table and click Details.
Tip: If you detect an issue that is related to server properties, update the server properties:
  1. In the Servers table, select a server and click Details.
  2. To update server properties, click Properties.
In the illustration of the Overview page, the number 4 corresponds to the Inventory area. Determine whether sufficient space is available for the server inventory, which consists of the server database, active log, and archive log.
  1. Click the Servers area.
  2. In the Status column of the table, view the status of the server and resolve any issues:
    • Normal The icon is a check mark. Sufficient space is available for the server database, active log, and archive log.
    • Critical The icon is a circle with an X mark. Insufficient space is available for the server database, active log, or archive log. You must add space immediately, or the data protection services that are provided by the server will be interrupted.
    • Warning The icon is a triangle with an exclamation mark. The server database, active log, or archive log is running out of space. If this condition persists, you must add space.
    • Unavailable The icon resembles a cracked ball. Status cannot be obtained. Ensure that the server is running, and that there are no network issues. This status is also shown if the monitoring administrator ID is locked or otherwise unavailable on the server. This ID is named IBM-OC-hub_server_name.
    • Unmonitored The icon is a question mark in a diamond. Unmonitored servers are defined to the hub server, but are not configured for management by the Operations Center. To configure an unmonitored server, select the server, and click Monitor Spoke.
You can also look for related alerts on the Alerts page. For additional instructions about troubleshooting, see Resolving server problems.
In the illustration of the Overview page, the number 5 corresponds to the DB2 area. Verify server database backup operations. To determine when a server was most recently backed up, complete the following steps:
  1. Click the Servers area.
  2. In the Servers table, review the Last Database Backup column.
To obtain more detailed information about backup operations, complete the following steps:
  1. In the Servers table, select a row and click Details.
  2. In the DB Backup area, hover over the check marks to review information about backup operations.
If a database was not backed up recently (for example, in the last 24 hours), you can start a backup operation:
  1. On the Operations Center Overview page, click the Servers area.
  2. In the table, select a server and click Back Up.
To determine whether the server database is configured for automatic backup operations, complete the following steps:
  1. On the menu bar, hover over the settings icon and click Command Builder.
  2. Issue the QUERY DB command:
    query db f=d
  3. In the output, review the Full Device Class Name field. If a device class is specified, the server is configured for automatic database backups.
In the illustration of the Overview page, the number 6 corresponds to the Servers menu. Monitor other server maintenance tasks. Server maintenance tasks can include running administrative command schedules, maintenance scripts, and related commands. To search for information about processes that failed because of server issues, complete the following steps:
  1. Click Servers > Maintenance.
  2. To obtain the two-week history of a process, view the History column.
  3. To obtain more information about a scheduled process, hover over the check box that is associated with the process.
For more information about monitoring processes and resolving issues, see the Operations Center online help.
In the illustration of the Overview page, the number 7 corresponds to the Activity area. Verify that the amount of data that was recently sent to and from servers is within the expected range.
  • To obtain an overview of activity in the last 24 hours, view the Activity area.
  • To compare activity in the last 24 hours with activity in the previous 24 hours, review the figures in the Current and Previous areas.
  • If more data was sent to the server than you expected, determine which clients are backing up more data and investigate the cause. It is possible that client-side data deduplication is not working correctly.
    Attention: If the amount of backed-up data is significantly larger than usual, it might indicate a ransomware attack. When ransomware encrypts data, the system perceives the data as being changed, and the changed data is backed up. Thus, backup volumes become larger. To determine which clients are affected, click the Applications, Virtual, or Systems tabs.
  • If less data was sent to the server than you expected, investigate whether client backup operations are proceeding on schedule.
In the illustration of the Overview page, the number 8 corresponds to the Pools area. Verify that storage pools are available to back up client data.
  1. If problems are indicated in the Storage & Data Availability area, click Pools to view the details:
    • If the Critical The icon is a circle with an X mark. status is displayed, insufficient space is available in the storage pool, or its access status is unavailable.
      Attention: If the status is critical, investigate the cause:
      • If the data deduplication rate for a storage pool drops significantly, it might indicate a ransomware attack. During a ransomware attack, data is encrypted and cannot be deduplicated. To verify the data deduplication rate, in the Storage Pools table, review the value in the % Savings column.
      • If a storage pool unexpectedly becomes 100% utilized, it might indicate a ransomware attack. To verify the utilization, review the value in the Capacity Used column. Hover over the values to see the percentages of used and free space.
    • If the Warning The icon is a triangle with an exclamation mark. status is displayed, the storage pool is running out of space, or its access status is read-only.
  2. To view the used, free, and total space for your selected storage pool, hover over the entries in the Capacity Used column.

To view the storage-pool capacity that was used over the past two weeks, select a row in the Storage Pools table and click Details.

In the illustration of the Overview page, the number 9 corresponds to the Devices areas.Verify that storage devices are available for backup operations. In the Storage & Data Availability area, in the Volumes section, under the capacity bars, review the status that is reported next to Devices. If a Critical The icon is a circle with an X mark. or Warning The icon is a triangle with an exclamation mark. status is displayed for any device, investigate the issue. To view details, click Devices. Disk devices might have a critical or warning status for the following reasons:
  • For DISK device classes, volumes might be offline or have a read-only access status. The Disk Storage column of the Disk Devices table shows the state of volumes.
  • For FILE device classes that are not shared, directories might be offline. Also, insufficient free space might be available for allocating scratch volumes. The Disk Storage column of the Disk Devices table shows the state of directories.
  • For FILE device classes that are shared, drives might be unavailable. A drive is unavailable if it is offline, if it stopped responding to the server, or if its path is offline. Other columns of the Disk Devices table show the state of the drives and paths.

Tape devices might have a warning or critical status if drives are unavailable. A drive is unavailable if it is offline, if it stopped responding to the server, or if its path is offline. A tape device might also have a critical status if the library is offline. Other columns of the Tape Devices table show the state of the library robotics, drives, and paths.

For tape backup operations, verify that sufficient scratch tapes are available. If you are not certain whether the number of available scratch tapes is sufficient, open the details notebook to view tape usage and an estimate of scratch tape availability. To open the details notebook, select a library in the table and click Details.

In the illustration of the Overview page, the number 10 corresponds to the Replication area.Monitor node replication processes.
  1. To obtain the overall status of node replication processes, view the Replication area on the Operations Center Overview page.
  2. To view information about each replicated server pair, click the Replication area.
    Attention: If you notice an unexpected increase in the number of replication failures, it might indicate a ransomware attack. Investigate the cause of the failures.
  3. To view the amount of data that was replicated over the last two weeks and the speed of replication, select a server pair and click Details.
  4. To view replication information for a client, on the Operations Center Overview page, click Clients. View the information in the Replication Workload column.
    Attention: If you see a drastic, unexpected increase in the replication workload, it might indicate a ransomware attack. Investigate the cause of the increased workload.
For advanced monitoring, view information about running and ended node replication processes by using commands:
  1. On the Operations Center Overview page, hover over the settings icon and click Command Builder.
  2. Issue the QUERY REPLICATION command. For instructions, see QUERY REPLICATION (Query node replication processes). If the replication operation was completed successfully, the Total Files To Replicate value matches the Total Files Replicated value.
To display messages that are related to a node replication process on a source or target replication server, complete the following steps:
  1. On the Operations Center Overview page, click Servers.
  2. Select the source or target replication server and click Details:
    • To view active tasks, click Active Tasks, select the task, and verify that the Running status is displayed. For details, view the related activity logs.
    • To view completed tasks, click Completed Tasks, select the task, and ensure that the Completed status is displayed. For details, view the related activity logs.