This article explains why it is best to be proactive using cloud performance metrics to fix problems before service outages can happen. Three proactive steps are described: Monitoring performance, testing performance, and crafting a cloud performance metrics policy.
To measure how well cloud services are performing, you should look at three expectation types — user, developer, and technology maintainer. Let's look underneath the user interface to discover what developers and technology maintainers are doing to make those user expectations occur.
It's a calm clear day here in cloud user land. Since everything is running smoothly as it should be, users can expect that they can find elements (such as the login box) in the application quickly and can scale up and down the application without issues. Oh, and that the application is always available. Users can also expect that the download time is fast, application response to user's requests is fast, and the provider backs up the data in the background,
Every user is happy. He rates the provider as having a good business reputation — reliable, fast, secured, and efficient. The application is performing well. Users expect that developers and technology maintainers are properly using performance metrics to ensure the application is running without issues. They need information from the application to make important business decisions or continue business operations.
Developers can expect all applications that ran well at the in-house data center will run well in the cloud. They can expect the application in the cloud is stateful (I will explain this later), has been properly fixed with a new version and will run fast, respond quickly to user's data requests, resources will scale up and down smoothly and without issues.
All developer expectations rely on user expectations. They want users to be happy with the application they develop for them. Developers monitor and test performance using metrics spelled out in a cloud service measurement policy.
If any of these performance metrics is missing, the developer has no way of checking how well the cloud application is performing. Poor performance can result in unexpected service outages leaving the users stranded without the information they need to make important business decisions or continue their business.
Technology maintainer expectations
Technical maintainers are usually the providers or third parties for the providers. They ensure that technologies are properly used to migrate the in-house application to the cloud. Once in the cloud, they monitor and test performance using metrics spelled out in a cloud service measurement policy.
Ensuring good performance is important in maintaining the provider's good business reputation as reliable, fast, secured and efficient. If any of these performance metrics is missing, the provider has no way of checking how well the cloud application is performing. Poor performance can result in unexpected service outages that leave
- Users stranded without the information they need to make important business decisions.
- Developers stranded in the middle of application monitoring and testing in the cloud.
Scene 1: Nightmare of service outages
One day in CloudUserLand, users could not access the application nor get a response from the application. All of the sudden they encounter service outages. Then they get a message on their screen that the provider "apologizes that we are unable to provide service since we are on scheduled maintenance."
Frantic for a fast solution, the developer becomes reactive, trying to save the provider's reputation, but in vain. First he stops the production while pretending to be on scheduled maintenance. He puts a notice on users' browsers that says "Service is temporarily down. Please wait." Then he begins to work on the application in house.
Meanwhile the users wait impatiently. In no time, user tempers flare and they demand their money back for poor service. They cannot make important business decisions or continue business to bring in more revenue. Some are on the verge of losing customers or their reputations.
The provider, once thought to be reliable and efficient, is now considered unreliable and inefficient. With service still not restored in a few hours, users immediately cancel subscriptions to that provider and go to another provider who has a long-standing reputation of providing reliability, availability, security, and efficiency.
Here's what the tech experts find out is going wrong:
- First, the developer finds it difficult to locate the code causing the problem of statefulness. The application did not respond correctly to subsequent states.
- Second, the developer finds out too late that the application that runs perfectly well in house was written by a previous developer as a lengthy single unit rather than, say, 500 modular parts. When the current developer tries to patch the application, the new version breaks the functions on of a website on which the SaaS application runs. The developer frantically divides the program into modular parts (about 10 to 20 to save time). When he tries to patch up one of the modules with a new version, he discovers that the versioning breaks the website's functions.
- Third, after the developer fixes the problem with the new version, he tests the application in the cloud to determine how well the resources are equally allocated to run the application. He discovers that load balancing (resource threshold) fails because a single resource that had reached the maximum failed. Other resources that had only reached 75 percent of their maximum capacity could not take over the business transactions from the failed resource instance. This created a domino effect on other performance parameters: the response threshold and data request threshold.
Before migration, the developer did not enable a threshold so that each resource instance be used at 50 percent of its capacity, so that if one resource instance fails, the healthy resource instances will take over. He did not check to see if the application running in healthy resource instances would respond quickly and if the users up to the limit specified in the user license can concurrently access the stateful application.
Scene 2: Proactive with performance metrics
There are three touchpoints when you can fix these problems: Before they happen, while they are happening, and after they happen. Here's why "before they happen" is the best position to avoiding the potential problem.
Act 1: Monitor performance
You should monitor performance of applications in the cloud to detect problems before they happen. If you do not, you must give users credits, refunds, or free months of service according to the terms of Service Level Agreement (SLA) for failure to provide service availability guarantees. Each SLA is measured for services, transactions, and servers based on the application's SLA on each server.
Log analysis is the most popular tool for monitoring performance for checking response times and concurrency. It can be cumbersome to use when monitoring the performance of multiple applications in multiple locations.
A better way of monitoring performance is to set up a dashboard of performance metrics to get to the heart of a problem as it is happening. When one of the metrics show signs of leaning toward negative results, you should be able to access metrics tools at your fingertips to make it easier for you to proactively root out the potential application problems before users find them.
Some suggested performance metrics include:
- resource threshold
- user threshold
- data request threshold
- response threshold
Statefulness metrics refers to how well the application responds correctly in the subsequent states. While most applications are inherently stateful, you never know when they become unstateful.
Versioning metrics refers to how well a new build avoids breaking an existing application's functions even if the previous application's statefulness has responded correctly from one state to another until the application tasks end. Versioning breaks can occur when assigning duplicate version names or numbers to the application.
Resource threshold refers to how well resource consumption is balanced dynamically for applications in the cloud. The threshold level should be at or below the maximum number of additional resource instances that could be consumed. When resource consumption exceeds the threshold level during a spike in workload demands, additional resource instances are allocated. When the demand returns at or below the threshold level, resources instances that have been created are freed up and put to other use.
User threshold refers to how well a user can access concurrently the application up to the limit specified in user license from the provider. For example, if a the license is limited to 3,000 users but only allows a maximum of 2.500 users to access concurrently, then the threshold level is set at 2,000 concurrent users. If the number of concurrent user is at or below the application threshold, the application is continuously available assuming that resource consumption and data requests are below their respective threshold level.
Data request threshold refers to the data requests that can be processed quickly. The threshold level is set below the maximum number of data requests and the maximum size of data requests that users can send concurrently. If the number of data requests exceeds the threshold level, a message should pop up to show how many data requests are in queue waiting to be processed.
Response threshold refers to how quickly the application responds to a user's data request or one part of the application to another part. The threshold level is set below the maximum, tolerable response time.
Response threshold also refers to what happens when the service provided by the application times out.
Act 2: Test performance
You should test performance before the dashboard begins to show potential problems and after you proactively find them (before the user does). You need more than load balancing tools to do this. A better way to test performance is to use the following metrics:
- Resource threshold
- User threshold
- Data request threshold
- Response threshold
Statefulness: How well the does application flow from one task to another? Did the state of function go from one task (like a library use provides identification information correctly) to the next task (in which the librarian checks out the book)?
Versioning: How well is a new version of the application performing? Is it breaking the logic of the application?
Resource threshold: How well are multiple applications using multiple resources? What is the capacity of the servers that resource instances are using? Each server should not hold more than 50 percent of the capacity; is there sufficient additional instances of resources during the spikes in demand workloads (say, for instance, during the Christmas buying crunch)?
User threshold: How many users can concurrently access the application? Can the system withstand the stress of sudden increase in the number of concurrent users?
Data request threshold: How many data requests can a queue hold and for how long? Is the queue moving the requests quickly?
Response threshold: How well does the application respond to a user or a data request? What happens when a task in the application times out? Does it go to another task to continue service to the users?
Act 3: Craft a performance metrics policy
If you cannot fathom how to start to build your own performance metrics policy, here's a checklist of what elements should be included in the policy:
- Purpose: What's it about?
- Scope: Draw a border around the policy.
- Background: Who's doing what?
- Actions: Roll up your sleeves.
- Constraints: Work with them.
Now for more detail.
- Purpose: What's it about?
- To help the users find out what the policy is about, state its purpose. Here is a template example you can use.
The purpose of this policy is to ensure all cloud service types are performing well based on for all performance metrics. They are statefulness, versioning, and four types of thresholds: resource, user, data request, and response.
- Scope: Draw a border around the policy
- Define the scope by drawing a border around the policy. Within this
boundary, specify which cloud service types the provider hosts and the consumers rents or subscribes. If the provider strays out of the fence while accepting the consumer's agreeing to comply, specify what the provider needs to do and how consumers can report performance problems with SaaS, PaaS, IaaS components singly or as a group.
Here are suggested scores for each metrics:
- Statefulness: 100 percent success of the current state of a task responding in subsequent states.
- Versioning: 100 percent success of not breaking the application's logic.
- Resource threshold: Up to 50 percent of the server's capacity to be occupied by instance resources.
- Data request threshold: A specified number of data requests a queue can hold.
- User threshold: Up to 70 percent of the users specified in the license.
- Response threshold: Number of seconds of response delay not noticeable by users (say, five seconds).
- Background: Who's doing what?
- The very first thing the consumer wants to know is whether the provider is internal or external. The next thing he wants to know is what performance metrics tools the provider will use to measure how well an in-house application performs in the cloud.
The consumer may also want to know how performance metrics are related to guaranteed levels of service availability as set forth in an SLA.
- Actions: Roll up your sleeves
- Here are some suggestions for actions to take:
- Action #1. Refer to metrics tools specified in the cloud computing metrics policy.
- Action #2. Indicate whether to monitor and test performance 24 hours a day.
- Action #3. Require training programs on monitoring and testing performance with metrics tools.
- Action #4. Give advanced notice on scheduled maintenance or upgrades to user access management, data protection technologies, and virtual machines.
- Action #5. Send notice to consumers on planned proactive actions to be taken on the use of performance metrics.
- Action #6. Send consumers copies of the cloud computing performance metrics policy for review and questions to be resolved before the consumer signs up for a cloud service.
- Action #7. Set up a dashboard of performance metrics to get to the heart of a problem when it happens.
- Constraints: Work with them
- There will be some constraints standing in your way, such as:
- Service priority issues to different groups of consumers depending on the roles assigned by the organization to consumers. An end user with administrative privileges, including using logs to monitor performance, has a higher priority over the end user that does not have them in accessing SaaS application.
- Service exceptions to the performance metrics and the SLA. I'll give you a hint: Accidental cutting of optics not being within direct control of the provider, scheduled maintenance (planned and unplanned), and proactive behavioral changes to applications migrating from in house to the cloud.
- Service penalties when system performance slides down significantly from the guaranteed level of service availability. Give the consumer the right to get credits, refunds, or free months as long as failure of guaranteeing the service is not a service exception. Specify in an exit clause on the process of enforcing the consumer's right to terminate a cloud service.
When you find constraints blocking you, the best thing is to work with them. First, you can use constraints to enhance security posture to performance metrics policy. Second, if any of the performance metrics are not satisfactory, get ready with remedies to protect the consumer if system performance slides down from the guaranteed levels of service availability.
Scene 3: What if you missed your chance to go proactive?
What happens if you missed your chance to go proactive? Should you react differently to the scenario previously discussed, the nightmare of service outages? On-the-fly repairs do not always work; patching a module may fail, as can load balancing. Performance parameters including resource, user, data request, and response thresholds were not in place.
Here's some actions you can fall back on:
- Backup application, system, and data at a remote location (very important in a disaster recovery plan in case of an earthquake).
- Use failover mechanism to ensure servers will take over the application and transactions from the failing servers.
- Interpret the SLA to give consumers credits, refunds, or free months for failing to guarantee service availability.
After the backups are restored quickly and failover mechanism immediately takes place and the provider carries out the terms of SLA for failure to meet service availability, you should start preparing to go proactive — monitoring, testing performance, and building a policy to put in place.
You need a bit of pre-planning to go proactive with cloud computing performance metrics: You have to decide how performance should be monitored and tested and how to craft your performance metrics to take cloud computing services into account.
Developers and technical maintainers must communicate with the cloud service users on how they expect the cloud service should perform so performance problems can be found before they happen.
Like with everything else in life, the most important thing of all to do as a cloud consumer is to get a copy of the cloud computing performance metrics policy from the provider for review and resolve your position on any questions you have before negotiating with the provider.
- The author discusses threshold policy in the articles "Balance workload in a cloud environment: Use threshold policies to dynamically balance workload demands," "Cloud computing versus grid computing: Service types, similarities and differences, and things to consider," and Build proactive threshold policies on the cloud."
- The author discusses proactive vs. reactive ways of making application changes when you migrate them to the cloud in the article "Change app behavior: From in house to the cloud."
- The author discusses cloud service security and how to mitigate risks to cloud services to ensure high uptime availability in the article "Cloud services: Mitigate risks, maintain availability."
- In the developerWorks cloud developer resources, discover and share knowledge and experience of application and services developers building their projects for cloud deployment.
- More developerWorks resources that match this article can be found at SOA and web services at developerWorks and industries at developerWorks.
- The next steps: Find out how to access IBM SmartCloud Enterprise.
Get products and technologies
- See the product images available for IBM SmartCloud Enterprise.
- Join a cloud computing group on developerWorks.
- Read all the great cloud blogs on developerWorks.
- Join the developerWorks community, a professional network and unified set of community tools for connecting, sharing, and collaborating.