Find provisioning performance bottlenecks in the cloud

Test and analyze provisioning performance of your cloud

Cloud provisioning is the process of deploying and managing IT resources on cloud infrastructures. Rapid provisioning is a key performance requirement for cloud services, especially when there are a large number of customers requesting resources at the same time. However, it is difficult to determine what factor, or combination of factors, are the causes of poor provisioning performance because there are no existing tools and methods to trace each status change and execution step in provisioning. The author details a provisioning performance test method you can use as a tool to determine where your provisioning performance may be lagging.

Share:

Xiang Wen Chen, Performance Engineer, IBM

Xiang Wen Chen is a performance engineer in IBM China Developer Lab with more than two years of experience with cloud-related projects and several papers about cloud computing for IBM internal publications and community sites.



07 September 2012

Also available in Chinese Japanese Portuguese

This article describes a provisioning performance test method you can use as a tool to determine where your cloud computing provisioning performance might be lagging. The purpose of this provisioning performance test is to:

  1. Provide a measurement of the total provision time from end-to-end from a user's perspective.
  2. Determine the provision time trends when there are multiple provisions at the same time.
  3. Break down the entire provision time into segments in order to figure out which components and which steps cost the most in performance overhead.
  4. Get the queuing information at the component level when there many provisioning requests in the system to help to find the bottleneck.

Let's examine some cloud provisioning basics.

Cloud provisioning basics

Cloud provisioning is the process of deploying and managing IT resources on cloud infrastructures. It consists of three types of provisioning:

  • Virtual machine provisioning: Includes instantiation of one or more virtual machines that match the hardware and software requirements of an application. Most cloud providers offer a set of VM templates with generic software and hardware configurations.
  • Resource provisioning: The mapping and scheduling of VMs onto physical servers within the cloud.
  • Application provisioning: The deployment of specialized applications within VMs and mapping of end user's requests to application instances.

Customers can also provision their own assets, such as a virtual server instance, an image, or a persistent storage unit through a web portal and API.

This article focuses on virtual machine provisioning and resource provisioning, two of the major functions within the cloud. Let's look at provisioning performance challenges.


Provisioning performance problems can be elusive

The provisioning process is complicated, made so by the unpredictable nature of virtualized IT resources and network elements. Customers often encounter an issue with provisioning performance, but find it hard to determine what factor or combination of factors make up the cause.

Some of the challenges cloud customers experience:

  • Different cloud providers use different provisioning engines. Users must have some knowledge of the provisioning engine they use before they can communicate effectively with the cloud provider about performance issues, not to mention determine the root cause.
  • At runtime, there may be unanticipated interoperability performance issues that prevent smooth provisioning. While some middleware components can be performance tested before they are integrated into the system, some performance problems may surface only after they interact with other middleware or when they require a specific configuration in order to satisfy a business need.
  • In a large computing environment such as data center, the availability, load, and throughput of IT resources and network can also impact performance.
  • Provision workflows can become complex and introduce performance problems. For example a general-purpose provisioning engine provides services that allow it to provision particular middleware components such as databases or application servers. The implementation of the actual underlying provisioning operation is implemented by a specific engine script or component. A large number of operations consist of provisioning services which can be integrated into different provisioning workflows.

Let's look at the provision performance test methods and analysis.


The provisioning test method and analysis process

The team I worked with developed a test method that first describes a set of states and operations so that the whole provisioning process can be defined independently from the specific provisioning engines used. Figure 1 describes these states.

Figure 1. Defining the provisioning process via states and operations instead of engines used
Defining the provisioning process via states and operations instead of engines used

During the Submission Period, the user submits a request and gets the response. If the invocation of the provisioning workflow is successful, the component's status changes from "accept" to "submitted". The status of the provisioning request changes to "New".

During the Resource Reservation Period, if the reservation of all needed components is not possible, provisioning of the complete solution is not possible and the provisioning flow will abort. To reserve a component, the component's status transitions from "Not Available" to "Reserving". If the reservation is successful, a reserved message is returned and the component's status changes to "Reserved".

During the Provision Period, as soon as the provisioning engine has performed all necessary steps to set up the component, a provisioned message is sent and the component transitions to state of "Provisioned". Once a component is provisioned, all its properties are available via its running customization log.

In this article, I conduct two kinds of tests — a baseline test and a load test.

Baseline measurements are necessary for a valid test. I recommend testing a small number of images with different types and sizes. In the baseline test, the total provision time breakdown for each of these three periods are recorded and time stamped for each status change. By calculating the period duration at request level, you have an initial idea of which part takes the longest. The virtual machine's compute capability and operating system version data is also collected to detect the provision issues for certain image configurations.

The load test simulates multiple provisions that may cause longer provision times. When running the load test, you monitor component level activities in order to find system bottlenecks. The time stamp for each component status change is recorded. Using the same kind of image as provision object can make clear comparison and show you a time trend. Watching how the provision duration changes as concurrent provisions increase helps you monitor the transaction behavior at a component level.

The team developed the test script based on users' end-to-end workflow. The script sends a provision request and traces the provision status all the way through to determine whether the provision is successful, failed, or timed out. Performance testing tools, such as Rational® Performance Tester, can be used to trigger the provisioning workload and capture such user-side data as image configure and provision periods at the request level. Most provisioning and management engines and tools offer a native approach to the manipulation of resources and components.

As a result, when using such tools, the client can record which operation on which engine and which service/which endpoint has been invoked. The provision period for each component level will be calculated from this data. A Python and VBscript is used for log parsing and report generation.


Monitoring the behavior of workflow components

The next section introduces some use cases where SmartCloud is used to test the cloud provisioning performance. But first, look at Figure 2 — it includes the key components in the provision workflow and interface used to monitor these components' behaviors.

Figure 2. Key components of the provisioning workflow
Key components of the provisioning workflow

The provisioning Submission Period mainly occurs on WebSphere® Application Server. The team used logs and request metrics to track individual transactions, recording the processing time in each of the major WebSphere Application Server components.

You can use the web console, API, and query database to gather response time for provisioning project and reservation workflow. The IBM Tivoli® Service Automation Manager/IBM Tivoli Provisioning Manager (TSAM/TPM) server allows users to request, create, delete, modify, and manage virtual resources. The resource reservation will occur on the TSAM/TPM Server.

The provisioning period occurs on the Tivoli Provisioning Manager Deployment Engine. Tivoli Provisioning Manager collects "debug" information which includes the time each workflow calls another workflow. The workflow status screen in Tivoli Provisioning Manager lets you access this runtime profile/debug data which provides details about your workflows such as how it flows and where it is spending its time. If you have a bottleneck in your application, you can analyze this data to understand in which components the specific bottlenecks are. Additionally, you can set up an IBM Tivoli Monitoring server to control and configure the resources, availability, and performance.


Results from the use cases

Let's look at the results of using this test method on three use cases:

  • Single VM provisioning
  • Multiple provisioning requests submitted on the WebSphere Application Server
  • Multiple provisions causing an increase the total provision time

Single VM provisioning

Problem: Certain kinds of image provisioning can take a long time during the provision period in our baseline test.

Details: The provision period mainly includes processing time in the Tivoli Provisioning Manager Deployment Engine and hypervisor. You can utilize the Tivoli Provisioning Manager workflow log to see each processing step.

Steps:

  1. Get the request ID for your provisioning request using the API.
  2. Get the service request that was created by your request on the Tivoli Service Automation Manager viewer JINSIGHT.
  3. Find the step in the workflow that took the highest percentage of time (Figure 3).
Figure 3. Visualize the deployment workflow using JINSIGHT
Visualize the deployment workflow using JINSIGHT

In Figure 3, you can see that in this workflow the copy clone image cost 24.7 percent; configure disks cost another 22.5 percent. The most expensive step in configure disk is check boot status.

Multiple provisioning requests submitted on the WebSphere Application Server

Problem: When you submit multiple provisions at the same time, the submission response time increases dramatically.

Details: The submission period mainly includes processing time on the WebSphere Application Server. You can collect and parse the trace log to get each processing step.

Steps:

  1. Set the related classes to the detail log level.
  2. Get the submission period duration for each provision.
  3. Record the time stamp for the first submission and the last submission. Collect the trace log between these two time stamps.
  4. Parse the records for the start and end time of each processing steps in each provision-handling process.
  5. Find the more expensive steps and draw the charts to reflect the response time trend. Compare these charts to find which cause the total submission time increase (Figure 4).
Figure 4. Total submission response time
Total submission response time

After parsing the trace log for these 13 requests, you'll find the most expensive step is making the OSS call that needs to interact with the Tivoli Service Automation Manager server. So I've drawn a line for this step:

  • The red line shows the total submission response time for each provision. It increases dramatically after 7 submissions.
  • The blue line is making the OSS call response time and is quite stable which means it did not contribute to the total response time increase.

Next, find the second most-expensive step and draw the time trend chart to compare the two.

Multiple provisions causes an increase the total provision time

Problems: When there are multiple provisions in the system, the total provision time will increase. You want to know which period contributes the most and which component is the bottleneck.

Details: You can use the provision load test report that displays the three-period duration variation to find out which period cost the most. Use the chart entitled "Provision a virtual machine component level workflow" to locate the suspicious component. Use existing monitoring facilities such as the Tivoli Service Automation Manager console, Tivoli Provisioning Manager console, and Tivoli Provisioning Manager DB to get the waiting time and service time for each provision. Build a queue model to find the bottleneck.

Steps:

  1. Start the multiple provisions using benchmark test tools.
  2. Get the Tivoli Service Automation Manager back end IDs for these provisions from the portal database.
  3. Pass the back end ID to the SQL scripts to query the Tivoli Provisioning Manager database to gather information from time to time.
  4. Calculate the real-time request service number, workflow number, average service time, and waiting time for each component based on data from Tivoli Provisioning Manager database.
  5. Build up a queue model along the process to uncover the bottleneck.

In the Figure 5, you can see that the 20 provisions sent at the same time have a large response time deviation even though they request the same image type and size.

Figure 5. Total provision time
Total provision time

In Figure 6, you see the response time for the reservation period dramatically increased which causes the total provision time to increase.

Figure 6. Three-provision response time
Three-provision response time

The submission period and provision period have a stable response time, in that case, you continue to map the second period to the component level workflow to find the bottleneck.

The second period mainly includes processing time on TSAM/TPM server. You can use the Tivoli Service Automation Manager console to look up the service requests and the Tivoli Provisioning Manager console to look up the deployment workflows. For multiple provisions, using SQL scripts to directly access Tivoli Provisioning Manager DB is the most convenient solution.

You can draw a flow chart (Figure 7) with the queue size at each component based on the number you calculate after analyzing the raw data from database.

Figure 7. Flow chart shows each component's relative queue size
Flow chart shows each component's relative queue size

In the flow chart, you see that the reservation component may have more waiting requests. The Tivoli Provisioning Manager Deployment Engine can only provision five virtual machines concurrently in order to ensure the performance on the hypervisor. Even if you manage to improve the capacity on TSAM/TPM, only five provisions can flow to the hypervisor; more provisions will be blocked in their second period.


In conclusion

The testing method that my team came up with is based on the general provision framework and can be combined with different cloud provider's existing monitoring solutions to help users find provisioning performance issues without relying on details provided by provisioning engines.

Given the mapping relationship between the provision workflows and components, the user can tell the provider which component may have the potential performance problems.

Using native monitoring tools and interfaces, a cloud provider can quickly locate which provision steps on which endpoint and which provision service impacts the whole provision process.

There are still additional improvements needed in this provision test method:

  • More performance metrics and data should be collected and organized by time, location, and relationship automatically; this would give the user a clearer picture of the issues.
  • More native monitoring tools and methods for different cloud provision engines should be integrated. The mapping relation between the front-end user data with back-end component data will be created automatically after setting specific provision engines.
  • More provision activities should be covered by this test, such as dynamic provisioning, de-provisioning, and application provisioning.

Resources

Learn

Get products and technologies

Discuss

  • Get involved in the developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Cloud computing on developerWorks


  • Bluemix Developers Community

    Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.

  • developerWorks Labs

    Experiment with new directions in software development.

  • DevOps Services

    Software development in the cloud. Register today to create a project.

  • Try SoftLayer Cloud

    Deploy public cloud instances in as few as 5 minutes. Try the SoftLayer public cloud instance for one month.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Cloud computing, Tivoli, WebSphere, DevOps
ArticleID=833191
ArticleTitle=Find provisioning performance bottlenecks in the cloud
publish-date=09072012