Getting started: Accelerate your transition to the cloud with DevOps

What do you do when you find out about a problem not from an incident but from a help ticket? Follow this scenario and learn some proactive measures you can take to avoid future problems.

Monitoring. Developers and IT Operations work together to deliver innovation at speed and scale, leveraging cloud native technologies such as containers, Kubernetes and ITOps. Successful enterprises are adopting these technologies for cloud-native applications and to modernize their existing ones to deliver business agility.

In the previous scenarios, you learned about the incident queue and the features for managing and handling incidents. In this scenario, the ITOps team notices some peculiar behavior that they'd like to monitor. You'll create a threshold to test for this behavior and use the Resources pages to help you fine tune the threshold definition.

Open the Threshold page

Go to Administer > Monitoring > Thresholds on the IBM Cloud Pak console.

Thresholds hexagon

In the Thresholds front page, you see a list of thresholds ordered by Permissions (Editable and Read-only), Assigned to (resource type, such as DB2 Instance and Linux Systems) and Name. For each threshold, you can see such information as the resource it monitors, the severity, and whether it is enabled and or disabled.

Threshold Management front page with the mouse hovering over the Create button.

Create a threshold

ITOps told you they were having disk file space issues on their Linux systems, so you create a threshold to monitor the percentage of time spent in read operations:

  1. Click Create to define a new threshold.
  2. In the Details section, name the threshold Linux\_disk\_reads\_80\_percent and add a helpful description such as "Warning event for disk read time of 80% or more".
  3. In the Threshold section, select the Linux systems resource type and warning severity. For the condition, specify Disk IO Ext Disk Read Percent greater than or equal (>=) to 80 percent.

    Create Threshold page

  4. Scroll to the Assignments section, select the Individual instance option, click in Select instances to open a list of Linux system instances, select an instance.

  5. Click Finish to save the threshold and return to the Thresholds page, which shows your new threshold.

    Threshold Management list showing just the one we created: Check disk reads, Warning, the two resources assigned two, Editable, and Enabled.

Review the Resources dashboard

Find out if the threshold you created is opening events where they're needed. Some thresholds might need fine tuning. Go to Monitor health > Infrastructure monitoring on the console. The resource types are displayed in alphabetical order.

All resources

Type linux in the search box to quickly find the Linux Systems resource type, and click the Linux Systems link.

Resource type

After you select Linux Systems, the resource instances are listed alphabetically.

Resource instance

Select the link for the same resource instance that you selected in step 4 when you defined the threshold. The instance dashboard is displayed with metrics from the past 12 hours. You can adjust the time span to show from the past 30 minutes up to one month at a time. If you have data retention configured, you can see up to one year of saved samples from the data provider.

Drag the dropped pin along the Events timeline to see the values displayed on every chart at that point in time. Any numbers along the timeline represent the number of events of the same type that are in close succession. Hover the mouse over an event marker to see when the event was opened and what triggered it. Check each marker, including the values before and after each event, to see if you can find a pattern.

Linux resource instance dashboard

Operating system dashboards plot metrics for the system characteristics. Scroll through the dashboard sections:

  1. Click a point or drag along the x-axis of a chart to read the values at that time.
  2. Select a device name in the Disk Device table to see the transfers per second for that device in the corresponding line chart. You can sort the table by clicking a column, and filter it by entering a value in the filter box.

    Linux OS instance dashboard section showing Disk Device table and Transfers per second line chart

  3. Expand the Process section and select multiple process IDs to see how the CPU (%) and Resident memory (MB) charts aggregate the values.

    Linux OS instance dashboard section showing the Process ID table and charts.

You can use the icon Collapse and icon Expand twisties to show only what's of interest to you.

What if you want to see other metrics?

There's another metric you'd like to check: page outs, which might indicate memory issues:

  1. Scroll down to the Custom Metrics section and expand the view.
  2. Open the Filter metric drop-down list and select System Pages Paged Out Per Second.

The list shows the metrics that are available for selection from the Linux data provider. You can add other metrics, but this metric and your analysis of the metrics around the event times is enough to tell you that the threshold you created needs a minor adjustment.

Linux Systems instance dashboard with System Pages Paged Out Per Second selected from the Filter metric list

Edit the threshold definition

Return to Thresholds and edit the threshold that you created earlier:

  1. Go to Administer > Monitoring > Thresholds.

    Linux OS instance breadcrumbs: Resources > Linux Systems

  2. Find the Linux\_disk\_reads\_80\_percent threshold in the list and click the link to edit the threshold.

  3. Change the Disk Read Percent value from 80 to 75.
  4. You tested the threshold on one resource and now want to disseminate it to all your Linux systems, so you change your selection in the Assign to resources section to Resource group.
  5. Select Finish to see the edited threshold assigned to Linux Systems.

Previous topic: Getting Started: Proactively manage the health of your application environment – regardless of size

You are the operations lead and want to automate some incident handling by adding a new policy. Follow this scenario to learn more about incident policies and user profiles and how they are manifested in the incident queue.

Next topic: Getting started: Performing SRE functions