This article describes a provisioning performance test method you can use as a tool to determine where your cloud computing provisioning performance might be lagging. The purpose of this provisioning performance test is to:
- Provide a measurement of the total provision time from end-to-end from a user's perspective.
- Determine the provision time trends when there are multiple provisions at the same time.
- Break down the entire provision time into segments in order to figure out which components and which steps cost the most in performance overhead.
- Get the queuing information at the component level when there many provisioning requests in the system to help to find the bottleneck.
Let's examine some cloud provisioning basics.
Cloud provisioning basics
Cloud provisioning is the process of deploying and managing IT resources on cloud infrastructures. It consists of three types of provisioning:
- Virtual machine provisioning: Includes instantiation of one or more virtual machines that match the hardware and software requirements of an application. Most cloud providers offer a set of VM templates with generic software and hardware configurations.
- Resource provisioning: The mapping and scheduling of VMs onto physical servers within the cloud.
- Application provisioning: The deployment of specialized applications within VMs and mapping of end user's requests to application instances.
Customers can also provision their own assets, such as a virtual server instance, an image, or a persistent storage unit through a web portal and API.
This article focuses on virtual machine provisioning and resource provisioning, two of the major functions within the cloud. Let's look at provisioning performance challenges.
Provisioning performance problems can be elusive
The provisioning process is complicated, made so by the unpredictable nature of virtualized IT resources and network elements. Customers often encounter an issue with provisioning performance, but find it hard to determine what factor or combination of factors make up the cause.
Some of the challenges cloud customers experience:
- Different cloud providers use different provisioning engines. Users must have some knowledge of the provisioning engine they use before they can communicate effectively with the cloud provider about performance issues, not to mention determine the root cause.
- At runtime, there may be unanticipated interoperability performance issues that prevent smooth provisioning. While some middleware components can be performance tested before they are integrated into the system, some performance problems may surface only after they interact with other middleware or when they require a specific configuration in order to satisfy a business need.
- In a large computing environment such as data center, the availability, load, and throughput of IT resources and network can also impact performance.
- Provision workflows can become complex and introduce performance problems. For example a general-purpose provisioning engine provides services that allow it to provision particular middleware components such as databases or application servers. The implementation of the actual underlying provisioning operation is implemented by a specific engine script or component. A large number of operations consist of provisioning services which can be integrated into different provisioning workflows.
Let's look at the provision performance test methods and analysis.
The provisioning test method and analysis process
The team I worked with developed a test method that first describes a set of states and operations so that the whole provisioning process can be defined independently from the specific provisioning engines used. Figure 1 describes these states.
Figure 1. Defining the provisioning process via states and operations instead of engines used
During the Submission Period, the user submits a request and gets the response. If the invocation of the provisioning workflow is successful, the component's status changes from "accept" to "submitted". The status of the provisioning request changes to "New".
During the Resource Reservation Period, if the reservation of all needed components is not possible, provisioning of the complete solution is not possible and the provisioning flow will abort. To reserve a component, the component's status transitions from "Not Available" to "Reserving". If the reservation is successful, a reserved message is returned and the component's status changes to "Reserved".
During the Provision Period, as soon as the provisioning engine has performed all necessary steps to set up the component, a provisioned message is sent and the component transitions to state of "Provisioned". Once a component is provisioned, all its properties are available via its running customization log.
In this article, I conduct two kinds of tests — a baseline test and a load test.
Baseline measurements are necessary for a valid test. I recommend testing a small number of images with different types and sizes. In the baseline test, the total provision time breakdown for each of these three periods are recorded and time stamped for each status change. By calculating the period duration at request level, you have an initial idea of which part takes the longest. The virtual machine's compute capability and operating system version data is also collected to detect the provision issues for certain image configurations.
The load test simulates multiple provisions that may cause longer provision times. When running the load test, you monitor component level activities in order to find system bottlenecks. The time stamp for each component status change is recorded. Using the same kind of image as provision object can make clear comparison and show you a time trend. Watching how the provision duration changes as concurrent provisions increase helps you monitor the transaction behavior at a component level.
The team developed the test script based on users' end-to-end workflow. The script sends a provision request and traces the provision status all the way through to determine whether the provision is successful, failed, or timed out. Performance testing tools, such as Rational® Performance Tester, can be used to trigger the provisioning workload and capture such user-side data as image configure and provision periods at the request level. Most provisioning and management engines and tools offer a native approach to the manipulation of resources and components.
As a result, when using such tools, the client can record which operation on which engine and which service/which endpoint has been invoked. The provision period for each component level will be calculated from this data. A Python and VBscript is used for log parsing and report generation.
Monitoring the behavior of workflow components
The next section introduces some use cases where SmartCloud is used to test the cloud provisioning performance. But first, look at Figure 2 — it includes the key components in the provision workflow and interface used to monitor these components' behaviors.
Figure 2. Key components of the provisioning workflow
The provisioning Submission Period mainly occurs on WebSphere® Application Server. The team used logs and request metrics to track individual transactions, recording the processing time in each of the major WebSphere Application Server components.
You can use the web console, API, and query database to gather response time for provisioning project and reservation workflow. The IBM Tivoli® Service Automation Manager/IBM Tivoli Provisioning Manager (TSAM/TPM) server allows users to request, create, delete, modify, and manage virtual resources. The resource reservation will occur on the TSAM/TPM Server.
The provisioning period occurs on the Tivoli Provisioning Manager Deployment Engine. Tivoli Provisioning Manager collects "debug" information which includes the time each workflow calls another workflow. The workflow status screen in Tivoli Provisioning Manager lets you access this runtime profile/debug data which provides details about your workflows such as how it flows and where it is spending its time. If you have a bottleneck in your application, you can analyze this data to understand in which components the specific bottlenecks are. Additionally, you can set up an IBM Tivoli Monitoring server to control and configure the resources, availability, and performance.
Results from the use cases
Let's look at the results of using this test method on three use cases:
- Single VM provisioning
- Multiple provisioning requests submitted on the WebSphere Application Server
- Multiple provisions causing an increase the total provision time
Single VM provisioning
Problem: Certain kinds of image provisioning can take a long time during the provision period in our baseline test.
Details: The provision period mainly includes processing time in the Tivoli Provisioning Manager Deployment Engine and hypervisor. You can utilize the Tivoli Provisioning Manager workflow log to see each processing step.
- Get the request ID for your provisioning request using the API.
- Get the service request that was created by your request on the Tivoli Service Automation Manager viewer JINSIGHT.
- Find the step in the workflow that took the highest percentage of time (Figure 3).
Figure 3. Visualize the deployment workflow using JINSIGHT
In Figure 3, you can see that in this workflow the copy clone image cost 24.7 percent; configure disks cost another 22.5 percent. The most expensive step in configure disk is check boot status.
Multiple provisioning requests submitted on the WebSphere Application Server
Problem: When you submit multiple provisions at the same time, the submission response time increases dramatically.
Details: The submission period mainly includes processing time on the WebSphere Application Server. You can collect and parse the trace log to get each processing step.
- Set the related classes to the detail log level.
- Get the submission period duration for each provision.
- Record the time stamp for the first submission and the last submission. Collect the trace log between these two time stamps.
- Parse the records for the start and end time of each processing steps in each provision-handling process.
- Find the more expensive steps and draw the charts to reflect the response time trend. Compare these charts to find which cause the total submission time increase (Figure 4).
Figure 4. Total submission response time
After parsing the trace log for these 13 requests, you'll find the most expensive step is making the OSS call that needs to interact with the Tivoli Service Automation Manager server. So I've drawn a line for this step:
- The red line shows the total submission response time for each provision. It increases dramatically after 7 submissions.
- The blue line is making the OSS call response time and is quite stable which means it did not contribute to the total response time increase.
Next, find the second most-expensive step and draw the time trend chart to compare the two.
Multiple provisions causes an increase the total provision time
Problems: When there are multiple provisions in the system, the total provision time will increase. You want to know which period contributes the most and which component is the bottleneck.
Details: You can use the provision load test report that displays the three-period duration variation to find out which period cost the most. Use the chart entitled "Provision a virtual machine component level workflow" to locate the suspicious component. Use existing monitoring facilities such as the Tivoli Service Automation Manager console, Tivoli Provisioning Manager console, and Tivoli Provisioning Manager DB to get the waiting time and service time for each provision. Build a queue model to find the bottleneck.
- Start the multiple provisions using benchmark test tools.
- Get the Tivoli Service Automation Manager back end IDs for these provisions from the portal database.
- Pass the back end ID to the SQL scripts to query the Tivoli Provisioning Manager database to gather information from time to time.
- Calculate the real-time request service number, workflow number, average service time, and waiting time for each component based on data from Tivoli Provisioning Manager database.
- Build up a queue model along the process to uncover the bottleneck.
In the Figure 5, you can see that the 20 provisions sent at the same time have a large response time deviation even though they request the same image type and size.
Figure 5. Total provision time
In Figure 6, you see the response time for the reservation period dramatically increased which causes the total provision time to increase.
Figure 6. Three-provision response time
The submission period and provision period have a stable response time, in that case, you continue to map the second period to the component level workflow to find the bottleneck.
The second period mainly includes processing time on TSAM/TPM server. You can use the Tivoli Service Automation Manager console to look up the service requests and the Tivoli Provisioning Manager console to look up the deployment workflows. For multiple provisions, using SQL scripts to directly access Tivoli Provisioning Manager DB is the most convenient solution.
You can draw a flow chart (Figure 7) with the queue size at each component based on the number you calculate after analyzing the raw data from database.
Figure 7. Flow chart shows each component's relative queue size
In the flow chart, you see that the reservation component may have more waiting requests. The Tivoli Provisioning Manager Deployment Engine can only provision five virtual machines concurrently in order to ensure the performance on the hypervisor. Even if you manage to improve the capacity on TSAM/TPM, only five provisions can flow to the hypervisor; more provisions will be blocked in their second period.
The testing method that my team came up with is based on the general provision framework and can be combined with different cloud provider's existing monitoring solutions to help users find provisioning performance issues without relying on details provided by provisioning engines.
Given the mapping relationship between the provision workflows and components, the user can tell the provider which component may have the potential performance problems.
Using native monitoring tools and interfaces, a cloud provider can quickly locate which provision steps on which endpoint and which provision service impacts the whole provision process.
There are still additional improvements needed in this provision test method:
- More performance metrics and data should be collected and organized by time, location, and relationship automatically; this would give the user a clearer picture of the issues.
- More native monitoring tools and methods for different cloud provision engines should be integrated. The mapping relation between the front-end user data with back-end component data will be created automatically after setting specific provision engines.
- More provision activities should be covered by this test, such as dynamic provisioning, de-provisioning, and application provisioning.
- Learn more about cloud computing technologies at cloud at developerWorks.
- IBM has encapsulated the most-expert best practices for cloud application deployment and systems configuration in the IBM PureSystems family of expert integrated systems; get started at PureSystems at developerWorks.
- Follow developerWorks on Twitter.
- Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
Get products and technologies
- Access more information on IBM PureSystems expert integrated systems.
- Evaluate IBM products in the way that suits you best.
- Get involved in the developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.
Dig deeper into Cloud computing on developerWorks
Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.
Complete cloud software, infrastructure, and platform knowledge.
Software development in the cloud. Register today to create a project.
Deploy public cloud instances in as few as 5 minutes. Try the SoftLayer public cloud instance for one month.