How do I monitor my IBM Cloud applications?
If you are a service owner or first responder, the following questions surely cross your mind each day:
What’s going on with my IBM Cloud application?
Are my customers satisfied with the service they’re getting?
Has performance changed recently?
Is anything unusual happening with my application?
What’s the difference between my on-premise and hosted services?
These questions often boil down to issues of performance and availability of your applications and services. You could use the IBM Cloud console and refresh your browser window every few minutes, but realistically, you need a solution that is automated and efficient.
In order to address the concerns above, we in IBM have defined the discipline of Cloud Service Management & Operations (CSMO) as “all the activities that an organization does to plan, design, deliver, operate, and control the IT and cloud services that it offers to customers.” Within that discipline are practices and toolchains which enable us to perform these tasks.
For this first post in the series, we’ll focus our attention on monitoring applications. Of course, there are tasks beyond monitoring that prompt questions like “If we have detected an issue, how will we solve it and return the application to a fully functional state?” Again, these tasks will be discussed in a future post.
Types of monitoring
Monitoring can be roughly divided into three types:
Metrics – collecting numerical information from the application and platform. This may be a number that is calculated by the application (e.g., how any items are in a queue) or exposed by the platform (e.g., how much memory is the process consuming).
Logging – collecting textual information (e.g., an error message generated by the application).
Synthetic monitoring – sending an external message to the application and examining the response to determine the component’s status (e.g., sending a ping to a server or simulating an entire customer transaction).
Once the monitoring system has discovered that a specific metrics has passed a threshold or a log entry matches, a test it will forward an event up the incident management toolchain so that the issue can be solved either automatically or manually.
Choosing the right monitoring for your environment
Today’s cloud environments are not homogeneous, they’re a combination of traditional non-cloud platforms, public and private cloud platforms, and even third-party platforms. This mixing of platforms means your monitoring solution must account for these non-cloud and multi-cloud environments. To help guide your decisions, the following table summarizes the practices and tools that apply:
Don’t be intimidated by the size of this table and the plethora of acronyms! The following sections will step through each row and explain them in more detail, going from the bottom row to the top. Let’s begin with the key differentiator, the environment type.
Different types of platforms in hybrid cloud
Below is the bottommost row of the table; it indicates the platform for which the rows above it apply.
Traditional Environment means the on-premise environment of physical or virtualized servers that has been common for years with both automated and non-automated provisioning.
IBM Cloud Private is IBM’s application platform for developing and managing on-premises, containerized applications (PaaS/CaaS).
IBM Cloud is IBM’s one-stop cloud computing solution which provides multiple types of solutions (IaaS/SaaS/PaaS/CaaS/FaaS).
3rd Party Cloud may be either on-premises or cloud-based, depending on the provider.
The rest of this post will discuss the differences and commonalities of monitoring metrics and logs in the various types of Cloud workloads. We will discuss synthetic monitoring in a future post since, being external to the workload, it is similar in all environment types.
Monitoring the various types of Cloud workloads
Since your organization’s environment is likely to be complex and a hybrid of multiple technologies and cloud types, it is likely that you will need a variety of solutions in order to monitor each service in the best possible way.
Watch the workload’s infrastructure/platform: Infrastructure is divided into on-premise where you are responsible for the platform down to the hardware, and off-premise/cloud where your service provider supports the infrastructure and your sole concern is that the platform is available.
Monitor Cloud-Ready workloads: These are workloads that are suitable for running in the cloud, but their heritage is from the traditional environment. Applications running on Virtual Machines that can be lifted and shifted to the cloud are the classic example of Cloud Ready applications.
Monitor Cloud-Native workloads: These are workloads that were designed to run in the cloud. Container, runtime & serverless applications are the typical kind of workload that is Cloud Native.
Collect logs: Since the multiple workloads create a wide variety and large amount of logs, it is critical to have a mechanism to collect and make sense of all the log entries. The collection and aggregation of logs is where problem analysis begins.
Remember that some solutions may be dedicated to specific workloads, but others may monitor multiple workloads. For example, you may use a single instance of Application Performance Manager to monitor both your traditional environment and IBM Cloud Private or install two instances, one for each workload. This decision will be made based on operational considerations and may differ from environment to environment.
Monitoring the cloud platforms
The first level of monitoring is that of the platform (when in the cloud) and the datacenter infrastructure (when on-premises). While each platform and infrastructure usually has a dedicated (siloed) monitoring solution, you can use Netcool Operations Insight (NOI) or Cloud Event Management (CEM) to collect events from these solutions and use Application Performance Management (APM) to monitor them independently.
The following is the list of IBM’s monitoring solutions for cloud platforms:
Application Performance Management (APM) is designed to intelligently monitor, analyze and manage cloud, on-premises and hybrid applications and IT infrastructure. APM can monitor all types of workloads for both Cloud and on-premise applications and infrastructure/platforms.
Netcool Operations Insight (NOI) and Cloud Event Management (CEM) are designed to collect, correlate and consolidate millions of events and alarms from your on-and off-premise environments. You use them to leverage siloed monitoring systems and gather information and events. NOI and CEM have a role in event management and incident management which goes beyond monitoring.
IBM Cloud has a status console that displays the state of the IBM Cloud platform, services and runtimes.
Monitoring cloud ready workloads
Cloud-Ready workloads (virtualized servers, middleware and so on) are also monitored using APM for the application performance and NOI/CEM to collect information from other monitoring solutions.
Those using Cloud Automation Manager (CAM) in IBM Cloud Private can orchestrate and control multiple clouds, but the monitoring of these resources is not performed under IBM Cloud Private itself. In other words, if you use CAM to provision a traditional virtual server within your datacenter, then you will use your traditional solution to monitor the servers and not the IBM Cloud Private monitoring solution.
Monitoring cloud native workloads
Cloud native workloads are workloads that are specifically designed to benefit from the features of automation and orchestration that cloud platforms provide. These include Containers running under Kubernetes & Cloud Foundry runtimes in both IBM Cloud and IBM Cloud Private and IBM Cloud Functions in IBM Cloud.
While the same monitoring solutions for Cloud-Ready workloads exist, Cloud-Native workloads have further available solutions:
Prometheus is an open-source systems monitoring and alerting toolkit which is part of the Cloud Native Computing Foundation, together with Kubernetes. It can monitor multiple workloads, but is mostly used with the Container workloads. Prometheus comes built-in with IBM Cloud Private and can be deployed manually to monitor IBM Cloud workloads too.
IBM Cloud Monitoring automatically collects metric data from IBM Cloud applications and services, eliminating the need for agents. APIs make it easy to add custom metrics and to query your monitoring data. Cloud Monitoring can monitor all types of workloads in the IBM Cloud.
APM for DevOps is a new member of the APM solution suite, dedicated to ensuring the optimal performance of your applications and to make the most efficient use of containerized resources.
Due to the dynamism of Cloud environments, the collection, aggregation and analysis of logs becomes a cornerstone of the monitoring solution. While Cloud-Ready applications may still write logfiles to disks and depend on an external collector to read them, Cloud-Native applications will usually simply stream messages out. These log entries will be lost unless an existing log collector saves them.
The following is the list of IBM’s solutions for collecting and analyzing logs:
IBM Cloud Log Analysis collects and aggregates application and platform logs for consolidated application or insights. It enables “zero configuration” out-of-the-box automated log collection of Cloud Foundry and Containers workloads. Log Analysis can collect logs from all types of workloads.
ElasticSearch, previously known as the ELK stack, enables you to securely and reliably search, analyze, and visualize your data. It is installed as part of IBM Cloud Private.
IBM Operations Analytics – Log Analysis helps turn terabytes of big operational log data into understandable and actionable insights for quicker problem solving and better overall service. It accelerates problem isolation, identification, and repair by providing dashboard views into analyzed sources of log data from solutions and devices across the service management infrastructure.
Log File Agents are components of APM which read and correlate logs
Service Management toolchain
While each of the cloud platforms and workloads may benefit from using a dedicated monitoring solution, the rest of the service management toolchain benefits from being consolidated. For example, it is simpler and easier for the organization if there is a single dashboard solution so everyone is looking at the same dashboard and a central ticketing solution to facilitate the tracking and transferring of tickets within the organization. These considerations are shown in the final and topmost row of the table, Service Management:
Further details about Service Management principles in general and the Incident Management process is particular may be found in the IBM Garage Method website.
This post reviewed the table summarizing the monitoring the tools and practices best suited for monitoring the four types of hybrid cloud environments (traditional, private, public, and third party):
It is your compass to better performance and availability.
In future posts, we will present specific monitoring solutions to explain how to get the greatest benefit from them. We will also continue exploring the Cloud Service Management & Operations toolchain and expand on the other capabilities necessary to achieve great performance and availability for your services and applications.