Back up 1000 VMware guests with Tivoli Storage FlashCopy Manager for VMware

A VMware environment with 1000 virtual machines was backed up in 36 minutes using IBM® Tivoli® Storage FlashCopy® Manager for VMware V3.1. This article discusses the program functions and parameters that achieved this result and suggests best practice guidelines.

Share:

Poulin M. Kao (pkao@us.ibm.com), Senior Software Engineer, IBM

Poulin Kao is a Software Engineer in IBM Tivoli who works on performance evaluation for Tivoli Storage Manager and related products.



James E. Damgar (jdamgar@us.ibm.com), Staff Software Engineer, IBM

James Damgar is a Software Engineer in IBM Tivoli who works on performance evaluation for Tivoli Storage Manager and related products.



27 November 2012

Overview

With the explosive growth in server virtualization, the demand for quick, efficient backup of virtual machines residing in VMware datastores is becoming more and more critical. IBM Tivoli Storage FlashCopy Manager for VMware Version 3.1, referred to as FlashCopy Manager for VMware in the rest of this article, is designed to handle such demand. As a virtual environment user, you want to quickly back up and restore your VMware datastores even as your virtual environment grows larger and the applications running on it become more important and critical to your business success.

To assess how well FlashCopy Manager for VMware meets the needs of VMware virtual environments, the Tivoli Storage Manager performance team conducted FlashCopy backup tests for VMware environments containing up to 1000 online virtual machines, with a total of 18TB of disk space.

The test results for up to 1000 virtual machines (the maximum tested) show that the FlashCopy backup elapsed time increases linearly with the number of virtual machines when the SNAPSHOT_EXCL_MEM snapshot mode and NUMBER_VM_CONCURRENT_TASKS=1 defaults are used. Tests that increased the value for the NUMBER_VM_CONCURRENT_TASKS parameter resulted in substantial improvement to the FlashCopy backup elapsed time. Highlights of the results are:

  • 500 virtual machines can be backed up by FlashCopy Manager for VMware in 15 minutes when SNAPSHOT_EXCL_MEM mode is used and NUMBER_VM_CONCURRENT_TASKS is set to 256.
  • 1000 virtual machines can be backed up by FlashCopy Manager for VMware in 36 minutes when SNAPSHOT_EXCL_MEM mode is used and NUMBER_VM_CONCURRENT_TASKS is set to 256.

However, increasing the value for the NUMBER_VM_CONCURRENT_TASKS parameter does not necessarily suit everyone's environment. We observed its positive and negative impact and formulated some practical recommendations, which are based on FlashCopy Manager for VMware and VMware operation behaviors in different VMware configurations.


FlashCopy backup test environment

FlashCopy backup tests were conducted in the following test environment. All the resources in this setup were dedicated to this particular test. We used scripts that utilized the FlashCopy Manager for VMware command-line interface (vmcli) so that we could have full control over test execution. Typically, you use the FlashCopy Manager for VMware Data Protection vCenter GUI plug-in, which is integrated with VMware vCenter, to perform FlashCopy Manager for VMware tasks, but the processing is similar.

Hardware and software components

We used five IBM 3850 M2 model xServers with very similar machine configurations. They differed only slightly in terms of the number of CPU cores and processor speed. Nonetheless, each one was powerful enough to handle the requirements of several hundred virtual machines. We allocated 512MB RAM for each virtual machine (VM), because each ESXi server comes with 128GB RAM and we wanted to build a minimum of 200 VMs in each ESXi server in order to have 1000 VMs in the VMware environment. For the 1000 VMs test, 200 VM guests were hosted on each of the five ESXi 5.0 servers in the test; whereas, with the 100-500 VM guests test, 100 VM guests were hosted on each ESXi server. The VMware environment that we tested consisted of five VMware ESXi version 5 servers, one VMware vCenter server, one Linux backup server (FlashCopy Manager for VMware backup server) and one IBM System Storage DS8000® storage system. Figure 1 illustrates the environment. Detailed lists of the hardware and software components are provided in Supplemental test environment information.

Figure 1. Test environment layout
Diagram of the VMware test environment

Testing considerations

FlashCopy backup tests were performed in increments of 100 VMs up to 500 VMs and then at 1000 VMs. System metrics on the backup server and ESXi servers were collected using OS tools and VMware tools. Virtual machines were powered on and had sporadic file update activities. By default, FlashCopy Manager for VMware performs VMware resignature to a FlashCopied LUN to preserve the uniqueness of the LUN. This action applies only to DS8000 and does not affect other IBM storage such as XIV® Storage System, SAN Volume Controller, or Storwize® V7000. We disabled this action for this test because we were interested in doing FlashCopy backups of the VMware datastore LUNs and not in keeping more than one version of a FlashCopied LUN online. For the 1000 VMs test, we updated FlashCopy Manager for VMware to level 3.1.0.1 which includes an enhancement that allowed us to extend the TIMEOUT_FLASH default value from 120 to 300 seconds. The NOCOPY mode for FlashCopy on DS8000 was selected for the FLASHCOPY_TYPE parameter in the FlashCopy Manager for VMware profile.


FlashCopy backup test results

Elapsed time was measured between the start and end of the test script that ran vmcli commands to FlashCopy back up the VMware datastore LUNs. FlashCopy Manager for VMware provides two parameters, VM_BACKUP_MODE and NUMBER_VM_CONCURRENT_TASKS (VMware virtual machine snapshot concurrency), to control how the VMware snapshot is handled before the storage FlashCopy of the datastore LUNs. The default for NUMBER_VM_CONCURRENT_TASKS is 1 and the default for VM_BACKUP_MODE is SNAPSHOT_EXCL_MEM.

Using FlashCopy Manager for VMware defaults, FlashCopy backup elapsed time increased linearly with the number of VMs

Figure 2 shows the results from tests that used the default settings. These results indicate an almost linear increase in the FlashCopy backup elapsed time as the number of VM guests increased to 1000 VMs hosted in 100 VMware datastores. The time spent performing storage FlashCopy, represented by the top line, was sub-second and was not affected by the increase. The storage FlashCopy time was rounded up to 1 second in the graph. The results from the default settings tests indicate that FlashCopy of the VMware datastore LUNs is a time-saving approach for backing up a large VMware setup because hundreds of VMs can be safely backed up in a matter of hours rather than days.

Figure 2. FlashCopy backup elapsed time with default settings
Graphs elapsed FlashCopy backup time against numbers of backed up VMs

Increasing VMware task concurrency can improve FlashCopy backup performance

In addition to the FlashCopy backup tests that used the defaults for VM_BACKUP_MODE and NUMBER_VM_CONCURRENT_TASKS, we also ran our FlashCopy backup tests on 500 VMs and 1000 VMs using various permutations of these two parameters.

  • VM_BACKUP_MODE: controls which type of VMware snapshot is performed prior to the FlashCopy (or allows the VMware snapshot to be skipped). The settings are described in Table 1.
  • NUMBER_VM_CONCURRENT_TASKS: specifies the number of concurrent snapshots of virtual machines that VMware performs at a time. It applies to all VM_BACKUP_MODE settings except ASIS mode. Tests were done with the value set to 1, 2, 4, 8, 16, 32, 64, 128 and 256.
Table 1. VM_BACKUP_MODE and associated VMware actions
Mode VMware actions
SNAPSHOT_EXCL_MEM Performs virtual machine snapshot without capturing virtual machine memory content.
Default –- recommended for most FlashCopy backup scenarios
SNAPSHOT_INCL_MEM Performs virtual machine snapshot and captures virtual machine memory content
SUSPEND Suspends the virtual machine and then resumes the virtual machine after storage FlashCopy
ASIS No snapshot of virtual machine. The storage FlashCopy is performed without any VMware action. NUMBER_VM_CONCURRENT_TASKS parameter does not apply to this mode.

In general, FlashCopy backup performance improved with increased NUMBER_VM_CONCURRENT_TASKS settings for all of the different VM_BACKUP_MODE settings (except ASIS). In our test environment, we observed a limit to the FlashCopy backup performance benefits when the NUMBER_VM_CONCURRENT_TASKS value reached the range of 32-64.

Figure 3 shows the effect of NUMBER_VM_CONCURRENT_TASKS on the FlashCopy backup of 500 VMs.

Figure 3. Elapsed time with various backup modes for 500 VMs backup
Graphs elapsed backup time for 500 VMs based on backup mode and number of concurrent tasks

The performance gains flatten out at about the 32 - 64 range of VMware task concurrency. There is only one data point in the graph for the ASIS mode test because FlashCopy Manager for VMware does not invoke any VMware actions before the storage FlashCopy.

Figure 4 shows the same trend from the effect of VMware task concurrency on the FlashCopy backup of 1000 VMs.

Figure 4. Elapsed time with various backup modes for 1000 VMs backup
Graphs elapsed backup time for 1000 VMs based on backup mode and number of concurrent tasks

With the exception of ASIS mode, FlashCopy backup took up a lot of resources on the ESXi servers because most of the actions were with VMware. As VMware task concurrency increased, the resources (such as CPU and disk I/O) of the ESXi servers were saturated and the performance improvement flattened out at about a range of 32-64 concurrent VMware tasks in the environment we tested.


Scalability considerations

Our test results indicate that the default settings, SNAPSHOT_EXCL_MEM for VM_BACKUP_MODE and NUMBER_VM_CONCURRENT_TASKS=1, are capable of handling many VMware datastores that host hundreds or thousands of virtual machines. Assuming that the storage system in your setup is not over-saturated, we recommend you start with the default settings. When you use the SNAPSHOT_EXCL_MEM mode, you get VMFS-consistent backups and do not spend additional time capturing the working memory of virtual machines during the snapshot time. Some special virtual machines might require working memory in the snapshot, but that is a less common scenario. Setting NUMBER_VM_CONCURRENT_TASKS to 1 does not put a heavy burden on the ESX server. When you combine it with the SNAPSHOT_EXCL_MEM option, you can reap the benefit of fast FlashCopy backups without adding strain to your host ESX servers. However, if you need even faster FlashCopy backups and you have resources to spare on your ESX servers, you can increase the VMware task concurrency by adjusting the NUMBER_VM_CONCURRENT_TASKS setting.

Tip: Tuning the VMware task concurrency is an exercise in tradeoff between ESX server resources, such as CPU and disk I/O, and FlashCopy backup performance. Here are suggestions on how each VM_BACKUP_MODE setting may take up ESX server resources during FlashCopy backups as you adjust NUMBER_VM_CONCURRENT_TASKS to tune your VMware task concurrency.

  • SUSPEND mode
    • ESX server Disk I/O would be saturated with more concurrent writing of virtual machine data to datastores.
    • ESX host CPU utilization increases dramatically, particularly on virtual machine resumes after the storage FlashCopy.
    • As observed in our test environment, improvement to the FlashCopy backup elapsed time slowed substantially with 8 or more concurrent VMware tasks. You must perform preliminary testing with your actual environment to determine your optimal NUMBER_VM_CONCURRENT_TASKS setting.
    • Applications are impacted by the need for the virtual machines to temporarily suspend.
  • SNAPSHOT_EXCL_MEM mode
    • As VMware task concurrency increases, CPU utilization of the ESX server is affected by snapshot creation and removal activity, although it is not saturated.
    • With higher VMware task concurrency, there is a tradeoff between faster FlashCopy backup performance and higher ESX server CPU utilization, which may impact the virtual machines running on the ESX servers.
    • In our test environment, FlashCopy backup performance improvements flattened out when NUMBER_VM_CONCURRENT_TASKS reached 32.
    • Due to resource constraints, we did not emulate virtual machines with close to real-life activity. Therefore, you should perform exploratory tests with your key virtual machines running to determine your optimal NUMBER_VM_CONCURRENT_TASKS setting for this mode.
  • SNAPSHOT_INCL_MEM mode
    • Both ESX server CPU and disk utilization were increased with increased VMware task concurrency.
    • There was potential saturation of the disk system hosting the VMware datastores. Our testing used a smaller 512MB of RAM for each virtual machine. With a larger amount, more memory is available to write out to disk for each virtual machine.
    • In our test environment, ESX server CPU utilization would reach 100% in a short time when we tested with NUMBER_VM_CONCURRENT_TASKS=32.
    • Our virtual machines did not run production-like activity. Therefore, you should perform exploratory tests with your VMware setup to determine your optimal NUMBER_VM_CONCURRENT_TASKS setting for this mode.
    • We observed the most improvement in FlashCopy backup performance with increasing VMware task concurrency.

Special notes for VMware datastores on DS8000 storage

If you have DS8000 storage in your VMware setup and intend to use FlashCopy Manager for VMware FlashCopy backup, keep the following points in mind:

  • Resignature, which performs a forced-mount of a FlashCopied LUN, is a default action and it adds extra time to the FlashCopy backup time. The added time can become substantial when the LUN size is large and there are many of them. To reduce the overall FlashCopy backup time, you can disable the resignature if you do not need to keep more than one FlashCopy backup copy.
  • You might encounter a TIMED-OUT error during FlashCopy backup operations when working with a large number of DS8000 data stores. You can apply patch level 3.1.0.1 for FlashCopy Manager for VMware, which extends the TIMEOUT_FLASH default from 120 to 300 seconds. You can also set the value that you want used for TIMEOUT_FLASH in the profile file after the patch is applied.

Supplemental test environment information

The following details supplemental the test environment information provided in FlashCopy backup test environment.

Hardware components

ESXi 5.0 Hosts:

  • ESX hosts: tsmcveh01 (Host 1), tsmcveh03 (Host 3)
    • IBM x3850 M2
    • 4 x 4-core (16 logical CPU) Intel Xeon X7350 @ 2.93GHz
    • 128GB RAM
    • 2 x 146GB 15K Internal SAS drives (mirrored)
    • 2 x 4Gbit dual port QLogic HBA cards
  • ESX hosts: tsmcveh02 (Host 2), tsmcveh04 (Host 4), tsmcveh05 (Host 5)
    • IBM x3850 M2
    • 4 x 6-core (24 logical CPU) Intel Xeon X7460 @ 2.66GHz
    • 128GB RAM
    • 2 x 146GB 15K Internal SAS drives (mirrored)
    • 2 x 4Gbit dual port QLogic HBA cards

VCenter 5.0 Server and FlashCopy Manager for VMware Linux proxy server in separate x3650 M1:

  • IBM x3650 M1
  • 2 x 4-core (8 logical CPU) Intel Xeon E5345 @ 2.33GHz
  • 24GB RAM
  • 2 x 300GB 15K Internal SAS drives (mirrored)
  • 1 x 4Gbit dual port QLogic HBA card

IBM 2005-B16 16-port fibre channel switches ( 2 )

  • Each ESXi Host connected to each switch via a 4Gbit link

DS8000 Model 2107-932 (2-frame)

  • 384 x ~146GB fibre channel drives (15k rpm)
  • 100 volumes, 202GB each, for VMware datastores
  • 100 volumes, 202GB each, for FlashCopy targets
  • A total of 24 6+1P RAID5 arrays and 24 7+1P RAID5 arrays (smaller are due to hot-spare coverage)
  • Two controller units each currently providing one 4Gbit connection to each of the above FC switches

Software components

  • All x3850 ESXi servers were installed with VMware ESXi v5.0.0-469512 code.
  • The vCenter x3650 server was installed with VCenter Server v5.0.0-455964 on Windows Server 2008 Enterprise (64-bit).
  • One x3650 server was installed as the FCM/TSM4VE proxy (tsmcvefcm01) with Red Hat EL 6.1 (64-bit). It ran the FlashCopy Manager for VMware GA-level code, plus the latest patches.

Configuration details for hosts and virtual machines

  • There was a 1-to-1 mapping between each 202GB DS8000 volume and each VMFS3 datastore.
  • 100 DS8000 volumes were used for datastores and the same number of volumes were reserved for DS8000 FlashCopy targets.
  • Datastores were distributed in a round-robin fashion to each of the 5 ESXi hosts for a total of 10 datastores per host.
  • A total of 10 VM guests ran (powered-on, though idle) on each of the 100 total datastores, with a total of 100 VM guests running on each ESXi host for the 100-500 VM guests tests and 200 VM guests running on each host for the 1000 VM guests tests.
  • The following four guest image types were used in this environment (250 total of each) with an approximately even distribution on each datastore:
    • SUSE Linux ES 11 SP1 (64-bit)
    • Red Hat Linux EL 5.6 (64-bit)
    • Windows Server 2003 (64-bit)
    • Windows Server 2008 (64-bit)
  • Each VM guest was provisioned with the following:
    • 1 vCPU
    • 512MB RAM
    • 17GB of virtual hard disk space (thin)

Resources

Learn

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Tivoli (service management) on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Tivoli (service management), Tivoli
ArticleID=847482
ArticleTitle=Back up 1000 VMware guests with Tivoli Storage FlashCopy Manager for VMware
publish-date=11272012