In a cloud environment, a threshold policy is a desired, important attribute — it is used to check and manage resources when workload demands need to be balanced dynamically after reaching a predetermined threshold level. The policy tells the system to create instances of the necessary resources depending how much workload demands exceed the threshold level.
Before I go into more detail on considerations for establishing and using a threshold policy to dynamically balance workload demands by automatically creating and releasing resource instances, let's define threshold policy in this context.
Let's look at a few key attributes of threshold policy.
The response period between the time the system detects workload demands reaching the threshold level and the time it creates the additional resource instances must be as near to instantaneous as possible. When workload demands return to a point below the threshold level, the system will de-allocate these resources and put them to other use.
The information that should be in a threshold policy is influenced by
- The type of cloud service the consumer rents.
- How much control the consumer has over the operating systems, hardware, and software.
- The type of industry the consumer is in (for example, retail, energy and utilities, financial markets, healthcare, and chemical and petroleum).
The cloud service provider may be either internal within an organization-controlled data center or hosted externally by a member of the telecommunication industry (such as IBM®). The provider must ensure integration with back-office systems so that ordering, provisioning, metering, rating and charging, billing and other functions support consumer activities and transactions.
As an example, a retail industry consumer of a cloud service had a large-scale application at a data center that did credit card validation in the cloud while workload demands were below the threshold level. When the Christmas buying season crunch hit, the system detected higher workload demands exceeding the threshold level. In response, the system quickly created additional instances of resources to balance workload demands dynamically.
As the retailer moved out of the buying crunch, the workload demands fell below the threshold level so instances of the resources in the cloud that were created were freed up. Since the organization has some controls over the hardware, they are able to negotiate with the cloud service provider on the terms set in the threshold policy. (It's always good to negotiate the parameters of the policy before the buying crunch.)
The remainder of this article provides some background on cloud services types and shows you how a threshold policy for a cloud type can be different from the policy for another cloud type. It discusses threshold policies on resource management for application testing, production, and capacity planning and looks at some of the more important issues to consider, such as impacts of a threshold policy on a service level agreement (SLA).
First consider which of these three cloud service types fit your needs:
- Software as a Service (SaaS)
- Platform as a Service (PaaS)
- Infrastructure as a Service (IaaS)
We'll also discuss how the size of your operation can influence whether your best choice of a cloud service type is public or private.
Let's assume as a retail industry consumer, you get a license from a SaaS provider for your company to run an application for web use as a service on demand. You choose a subscription or pay-as-you-go method because you do not have hardware or software to buy, install, and maintain, nor do you have to update the application.
The only control you have is in using the provider's application from a desktop or mobile device to process such business tasks as computerized billing and invoicing and human resource management.
Although you do not control deployed applications, operating systems, storage, or networking, you need to see a threshold policy from the provider on resource management in case there is a surge, planned or unexpected, in workload demand:
- You want to know how the provider sets threshold levels to ensure the continuous operational availability of the SaaS.
- You want to know what the provider's SLA terms and backup policy are.
- If the service fails because the provider was unable to handle a surge in demands dynamically, you want to know if you can get credits, refunds, free months, or terminate the SaaS as set forth in a SLA.
With PaaS, you want to develop retail applications from creation to deployment for application testing (or production as a service).
Unlike with SaaS, you can control all the applications found in a full business life cycle for the platform (for example, spreadsheets, word processors, backups, billing, payroll processing, and invoicing).
The provider controls the operating system, hardware, or network infrastructure on which the applications are running. The provider can build, deploy, run, and manage upgrades and patches to all functionalities, say, of a retail management application.
Of course, you want a threshold policy from the PaaS provider:
- You want to know how the provider sets threshold levels to ensure the PaaS will continue to be available.
- If the service fails because the provider was unable to handle a surge in demands dynamically, you want to get credits, refunds, free months, or terminate the service.
With IaaS you can control the operating systems, network equipment, and deployed applications at the virtual machine level:
- You can scale the number of virtual servers or blocks of storage area up or down.
- You can pay per-use for the infrastructure of these traditional computing resources in the cloud environment.
You will need to see a threshold policy for the infrastructure from the IaaS provider:
- You want to know how the provider sets threshold levels to ensure the IaaS will continue when there is a surge in workload demands.
- You want to be able to negotiate with the IaaS provider on the terms in the threshold policy and the SLA for your company.
- If the service fails because the infrastructure of computing resources was unable to handle a surge in demands dynamically resulting in slow response times, you want to get credits, refunds, free months, or terminate the service as stated in a SLA.
As an example, my company generates revenues greater than US$1 billion. We find private clouds may be more cost-effective than public clouds. A private internal cloud has many of the same business characteristics as a public cloud, but with much higher levels of governance, security, availability, and recoverability than small businesses with revenues, say, less than US$1 million would have.
With a public cloud, data may be stored in unknown locations and may not be easily retrievable. This is in contrast to a private cloud that allows a consumer to retrieve data from known locations in a specific jurisdiction (like the US). Unknown locations are not suitable for storing compliance, privacy, and sensitive test data. They might be in geographical areas where privacy and compliance regulations in one country differ from those in another country. Laws vary from one country to another regarding data export controls.
When creating a threshold policy, my company requires the highest levels of dynamic balancing of workload demands in a cloud environment. The system must be able to quickly create quickly additional instances of the resources when workload demands exceed the threshold level.
Due to large operation size of my company, transaction-oriented workloads are higher than they would be for small businesses. The range and number of transaction types are greater for my company than for small businesses. Since transaction types are identified by two- or three-bit numerical or character code, a large company or small business needs to associate a business transaction category for each type. A business transaction category appropriate for a large company (such as financial leasing) may not be appropriate for a small business.
Threshold policy varies from one industry to another for each cloud service type. The policy can be influenced by organization type, organization size, market conditions, seasonal workload demands, the economy, changing mandates, emerging technologies, and the frequency of adverse weather conditions.
The number of data centers also depends on the industry; for instance, the government sector is a heavy user of data centers and has been looking for ways to save costs by renting services on demand to ensure operation availability and security in the cloud environment.
I've already mentioned six industries as examples — retail, energy and utilities, financial markets, healthcare, telecommunications, and chemical and petroleum; there are others.
- Aerospace and defense
- Consumer products
- Forest and paper
- Life sciences
- Media and entertainment
- Metals and mining
- Travel and transportation
- Fabrication and assembly
- Industrial products
- Life sciences
- Wholesale distribution and services
Let's compare and contrast the retail and chemical/petroleum industries on considerations for threshold policy. When each industry's system detects workload demands exceeding the threshold level, the system quickly creates additional instances of resources to balance workload demands dynamically. As the workload demands fall below the threshold level, the instances of resources that were allocated are freed up.
The retail industry comprises of small businesses and large companies engaged in the selling of finished products to end user consumers. The petro/chemical industry comprises of industrial plants and large and small companies investing in and selling of oil, gas, and chemical products to industrial consumers.
Spikes in workload demands for the retail industry are usually predictable (like Christmas buying crunch season). Those for the petrochemical industry are usually based on different factors that aren't as easy to predict so they mostly tracked: the economy, drive for supply chain optimization, investments in deep oil drilling, and unpredictable adverse weather conditions (such as warm winter one year, blizzards the next).
The differences in transaction types (industrial vs. retail) and the choice of public, private, or hybrid clouds affects the creation of threshold policy. Transaction types are used to group revenue and expense items according to business or product groups.
Good resource management is important in balancing resource consumption in the cloud environment. A threshold policy ensures resource consumption is balanced dynamically for application testing and production. Application testing may have different threshold requirements than those for production. Use capacity planning ahead of time to prepare your system to allocate additional resource instances when workload demands reach the threshold level.
Although IT professionals are used to thinking in abstract terms, a key aspect to approaching threshold policy creation is to remember that a critical component of workload demands is physical. You are depending on reliability rates of physical components, even with the wireless bits.
The threshold policy should set what a threshold level should be, such as the threshold level of 75 or 85 percent of the capacity of one or more disks. It should include mechanisms of logging in and monitoring resource consumption.
In addition to capacity, when the threshold level is reached, the number of resource instances allocated, and the response time in allocating the instances should be in the logs. Also, the logs should include:
- Statefulness of the application
- Resumption points
- Failover mechanisms
- Cloud service security
Statefulness refers to whether one state of the application responds adequately to subsequent states of the application functions in the cloud environment. For instance, one state should go to the next state of function down the line in this overly simplified scenario:
- The consumer selects a retail item online
- The retailer places the selected item in the cart basket
- The consumer provides credit card information
- The consumer submits the order
- The retailer validates the credit card information
- The retailer gives an order number and estimated delivery time
- The retailer thanks the consumer for the order
- The consumer gets an order confirmation email
- The consumer gets an email that the order is being shipped
If the state of function for step 2 did not go to step 3, what could be the cause of the problem?
- Did the new builds in the application break the logic?
- When the system detected the threshold level exceeding the workload demands, was the threshold level set too high so that the remaining resources were insufficient to continue operation?
- If the threshold level was appropriate, were there sufficient additional instances of resources in the cloud needed to ensure that the state for the one step would go to the next step?
The log should show what state the application was in and whether completion of the state was a success.
The system should create a resumption point (of the scheduled, manual, and installation varieties) at different points of time before a problem with the system occurs.
Snapshots of the disks containing resumption points should be backed up on both the disk in the local system and to another disk at a different, remote location. The log should indicate the time when the resumption points were created and what resumption point was used to restore the system.
The system should also be able to initiate failover mechanisms to continue operation availability.
The failover mechanisms should include alternate wired or wireless connections in case, for example, the telecommunication provider accidentally cuts the fiber line or shuts down the wireless network connected to the consumer's physical facility. The log should indicate the type and location of the device used in the failover.
Failover mechanism examples include:
- Load sharing redundant. Two or more systems loaded with no more than 50 percent of the total load. When a device fails, other devices pick up the load with little or no interruption.
- Instance resource redundant. Two or more resource instances loaded with no more than 50 percent of the total load. When a resource instance fails, other resource instances picks up the load.
- Alternate connection retry. If network interruption last more than two minutes, attempt to reconnect to another server via alternate connections.
Cloud service security can be threatened by poor credentials, protocol exposure, and implementation flaws in remote management. Reusing IP addresses can lead to an unintentional Denial of Service attack (DoS).
SaaS can be affected intentionally with a virus that results in a DoS. Hackers have used PaaS as well as IaaS platforms as Command and Control centers (CnC) to direct operations of a botnet (robotic network of computers) for use in distributed denial of service (DDoS) and installing malware software in the cloud.
The log should show what type security problem a cloud service type had and when and how the problem was fixed.
Although your service provider is generally responsible for underlying cloud computing systems, you still have a legal responsibility to ensure that their systems meet your regulatory requirements, that their practices are reasonably secure, that their administrators cannot access your data without authorization and that an SLA is in place.
Make sure that you understand how the SLA works, how the threshold policy would impact the SLA, and what the procedures and expectations are just in case your service provider lets you down.
Important components of the SLA are uptime availability, performance standards, emergency response times, violation remedies, and security.
Find out how the threshold levels may vary from those specified as the performance standards for uptime availability in the SLA. They should not be set at or above the availability standards. Choose uptime availability (97 or 99.9 percent) and then threshold levels that best meet your business needs and budget.
In the event of an SLA violation, remedies should be provided. For instance, your service provider should issue a free credit or a refund if it misses an SLA (slow responses in creating additional instances of resources in the clouds). If the provider misses the SLA several times for three months, it should allow you to terminate your service. Make sure the termination clause is included in the SLA and read it very carefully.
Does the SLA say who and where the authoritative source would be if you and the provider disagree on the length of an outage? You need to know how long you should wait after an event to file a claim. Review if your insurance policy addresses things that are not covered in an SLA including lost revenue, damaged reputation, or a data breach.
Setting up a threshold policy to dynamically balance workload demands requires planning ahead to resolve the issues of creating additional instances of resources in the cloud environment. Developers should communicate with both the cloud service consumer and provider on the issues of economies of scales (public versus private clouds) and developing threshold policy for application testing and production. Use capacity planning ahead of time to prepare your system to allocate additional resource instances when workload demands reach the threshold level.
The author discusses threshold policy in the article "Cloud computing versus grid computing: Service types, similarities and differences, and things to consider."
In the developerWorks cloud developer resources, discover and share knowledge and experience of application and services developers building their projects for cloud deployment.
More developerWorks resources that match this article can be found at SOA and web services at developerWorks and industries at developerWorks.
The next steps: Find out how to access IBM Smart Business Development and Test on the IBM Cloud.
Get products and technologies
See the product images available on the IBM Smart Business Development and Test on the IBM Cloud.
Join a cloud computing group on developerWorks.
Read all the great cloud blogs on developerWorks.
Join the developerWorks community, a professional network and unified set of community tools for connecting, sharing, and collaborating.
Judith M. Myerson is a systems architect and engineer. Her areas of interest include enterprise-wide systems, middleware technologies, database technologies, cloud computing, threshold policies, industries, network management, security, RFID technologies, presentation management, and project management.