The main reason for implementing virtualization is to maximize the utilization of your infrastructure and to improve return on investment (ROI). When you use virtualization technology, you can share system resources, thereby improving utilization, density, and data-center economics. Today, IBM System x servers can be configured with 16 cores in 2U of rack space and 40 cores in 4U of rack space. That means a typical cluster consisting of three to four servers can provide 48-160 cores in as little as 6U of rack space.
Virtualizing IBM DB2 pureScale instances can provide you with an isolated but consolidated environment for test, QA and production database systems by allowing multiple pureScale clusters to operate within the same set of hardware resources simultaneously. KVM is a Linux hypervisor, implemented as a kernel module, that provides multiple guest virtual machines (VMs) concurrent access to the hardware virtualization features of Intel processors. It uses QEMU (user-space emulator) for I/O hardware emulation. It can also be managed via the libvirt API and UI-based tools. KVM understands the non-uniform memory architecture (NUMA) characteristics of the Intel CPU and provides guest VM support for the remote direct memory access (RDMA) host channel adapter (HCA).
Currently, you can create virtual Ethernet devices, switches, and networks with KVM. However, the capability to create virtual RDMA devices on a hypervisor does not exist. To work around this limitation, you can use PCI passthrough. In System x servers, the KVM hypervisor supports attaching PCI devices on the host system to virtualized guests using the Intel Virtualization Technology for Directed I/O (VT-d). The PCI passthrough allows these guest VMs to have exclusive access to PCI devices for a range of tasks. It also allows those PCI devices to appear and behave as if they were physically attached to and owned by the guest operating system (OS). With PCI passthrough, the guest VMs can become owners of the RDMA-capable adapters that are used in a DB2 pureScale instance.
Below you will find a template solution that will take you through the steps required to configure and deploy the IBM DB2 pureScale feature in a virtualized environment using KVM. The template provides you with a set of possible configurations using different IBM System x servers and explains the cloning of VMs to rapidly deploy additional DB2 pureScale instances. We describe the configuration and performance of DB2 10.1 with the pureScale feature running in Red Hat Enterprise Linux 6.2 guest VMs on a Red Hat Enterprise Linux 6.2 host with the KVM hypervisor. Follow the steps below to deploy the virtualized DB2 pureScale instances on a KVM hypervisor.
- Select the guest options: hardware and software
- Plan and configure the storage area network (SAN)
- Configure the components of KVM
- Create and clone the KVM guests
- Deploy the DB2 pureScale instances
Information provided in Table 1 shows the possible guest options for virtualizing the DB2 pureScale feature using KVM on three different System x servers. For the sample configuration the System x3850 X5 server is used, but to provide you with options, the System x3650 M4 and System x3690 X5 configurations are included. Note that the number of guest VMs for a given server is a function of the number of PCI-E slots and the number of Fibre Channel paths required. For production guests, you would typically configure multiple redundant Fibre Channel paths, but for test and development systems, a single Fibre Channel path may be sufficient.
Table 1. DB2 pureScale virtualization on System x servers — guest options
|System x server||KVM guests (Fibre Channel multipath*)||KVM guests (Fibre Channel single path)||KVM guests (At least 1 Fibre Channel multipath)||Maximum number of PCI-E slots|
|3** ||4** ||3 ||6|
|2 ||2 ||2 ||5|
|3 ||4 ||4 ||7|
* Using dual-port Fibre Channel
** Using 10Gb Ethernet Mezzanine option
[ ] PCI slots left over
The number of guest VM options will be a factor in determining the number of DB2 pureScale instances you have. Using the device passthrough technique, and considering that there are a total of seven PCI-E Gen2 slots on an x3850 server with four sockets, you can create three KVM guests where each physical server will have this configuration:
- Three PCI-E Gen2 slots for InfiniBand */RDMA over Converged Ethernet (RoCE)
- Three PCI-E Gen2 slots for dual-port Fibre Channel (Two ports per VM)
- One PCI-E Gen2 slot - could be dual-port 10Gb Ethernet
The location and layout of a DB2 pureScale instance determines the number of virtualized instances. For example, with four x3850 X5 four-socket servers, you can have up to 12 KVM guest VMs. As Table 2 shows, the number of pureScale instances deployed is determined by how many devices are assigned to each guest VM and whether they are collocated within a guest VM or have their own dedicated guest VM.
Table 2. Maximum number of DB2 pureScale instances on x3850
|Collocated member/CF (2member/2CF)||Member/CF on dedicated VM (2member/2CF)||Mix dedicated/collocated (2member/2CF)||Collocated member/CF (4member/2CF)||Member/CF on dedicated VM (4member/2CF)|
|Maximum number of DB2 pureScale instances||6||3||5||3||2|
Refer to the DB2 Virtualization Support page to see which virtualized environments are supported. At a minimum, the following DB2 pureScale KVM guests options are required:
- Red Hat Enterprise Linux 6.2
- PCI passthrough for GPFS filesystems
- RDMA interconnects
The storage layout for the DB2 pureScale instance requires certain prerequisites and planning. The connection between storage systems, SAN switches, and hosts needs to be configured before configuring the disks. For detailed information, refer to the DB2 pureScale Feature Information Center. For simplicity, we recommend separating the storage by creating dedicated pools with mdisks, so that each storage instance will have its own logical unit numbers (LUNs). Once the disks are created, using the multipath driver on Linux, hosts can see all the available disks. The KVM guests will be able to see the disks once the Fibre Channel devices have been assigned to them. To lay out the storage, you must first determine the number of DB2 pureScale instances you will need. Based on the number of instances, you will configure the shared storage disks and tiebreaker disks. For details on DB2 pureScale disk configuration, refer to the DB2 pureScale Information Center.
The hardware should be Intel virtualization (VT-x) and VT-d capable. Intel Xeon X7560-class processors or better are appropriate. Virtualization instruction support must be enabled in the BIOS settings. The PCI device assignment is available on hardware platforms supporting Intel VT-d. View the Enabling Intel VT-x Extensions in BIOS web page for more information.
Install the Linux operating system with the KVM feature. The following are system prerequisites:
intel_iommu=onparameter within the kernel boot parameters found in /etc/grub.conf file
- At least one InfiniBand* or 10 Gb Ethernet (RDMA over Converged Ethernet) adapter and one Fibre Channel port available per guest VM. The InfiniBand and Fibre Channel adapters must be on their own IRQ and must not be sharing it with any other device on the system. Note that KVM cannot start up a guest with a PCI passthrough device that is sharing an IRQ with another device. Check lspci -v output to see if the devices you plan to pass to a KVM guest are sharing an IRQ.
A public network bridge is required if the KVM guests need external network access, including communication to and from an external DB2 client. Refer to the Setting up a Network Bridge in the Host web page for more information. Before proceeding, ensure you have the required RPM packages such as bridge-utils, iproute and tunctl. A dual-port 10-GB Ethernet adapter as the shared interface for this type of external traffic is recommended, as all of the VMs will be sharing this bridge and you will need ample bandwidth.
Before starting to create the KVM guests and cloning them, it is important to confirm that you have the latest update for the host OS and device hardware. Refer to the DB2 Information Center for the list of supported levels. Create and clone the guest VMs by following the steps below.
Create the guest VM
Once the host operating system is installed with the KVM features, use the
#/etc/init.d/libvirtd statuscommand to ensure KVM is running.
#/etc/init.d/libvirtd startcommand if KVM needs to be started.
The most intuitive method to create a guest VM is by using the graphical virt-manager program. To prepare for this type of install, you should have the Red Hat install media in its .iso file format placed somewhere on the host. For more details, refer to the Creating Guests with virt-manager web page. You will be going through the following high-level steps:
- Click on Create a new virtual machine, edit the VM name (for example, db2mVM1), and select Local install media. Then click Forward to continue.
- Browse to the Red Hat install media .iso file and select the OS type as Linux and choose the correct version. Click Forward to continue.
- Select the number of CPUs and memory for this VM. Click Forward to continue.
- Create a file-based disk to store the OS root image. A minimum of 10 GB is advised. This makes mounting the file-based device from outside the VM much simpler if you need to make any changes to a file on its file system without the VM running. It also allows for flexibility if you want to automate some cloning procedures in the future. You should keep the partitioning as simple as possible by creating only one partition. Click Forward to continue.
- Select the Advanced options setting and change Virtual network from Default to Specify shared device name. An input field for the bridge name appears. Specify the public bridge here, (for example, pubBr0). Click Finish. A console will appear and you will go through the Red Hat installation procedure just as you would on a normal bare-metal system
Make a note of the location of the disk-image file created in Step 4 and the XML file generated by the virt-manager. It is usually located in /etc/libvirt/qemu. You will need these files to clone the other VM images. You should now have a guest VM with an installed OS. The host can also see the guest VM. You can use the virsh command-line-interface tool for managing guests and the hypervisor. Refer to the Managing Guests with virsh web page for more detail.
Confirm the prerequisties for DB2 pureScale features
You need to set up and configure at least two networks: Ethernet and a high-speed communication network. The high-speed communication network must be an InfiniBand* network or a 10GB Ethernet network as specified in the DB2 pureScale Linux prerequisite information center page. It is important to note that the configurations must be completed after the VM has been created. Refer to Configuring Ethernet and High-speed Communication Network web page for more detail. To use the guest VM in a DB2 pureScale environment, the same requirements and prerequisites that apply to a physical host must be in place. To confirm the prerequisites, refer to the DB2 pureScale Feature requirements web page. Examples include setting max locked memory to unlimited, rsh/ssh set up, disabling SELinux, and so on. Satisfying all the checklists should make the guest image ready for cloning, which will make it easy to replicate the same image.
Clone the guest VMs based on instance topology
Cloning a guest VM can be accomplished using different methods. For details on cloning the VM images, refer to your vendor's best practices. The steps outlined below are used to clone a guest VM.
- Ensure the source VM is configured properly for a pureScale deployment. It should have the correct OS level, the correct rpm packages installed, and the correct settings to meet all the pureScale prerequisites.
- If the source VM had PCI-passthrough devices, these should be removed before proceeding with the cloning.
- Make a copy of the OS root disk of the VM (this should be a file on disk,
use the command
virsh editon the guest and locate the
source file=tag if you are unsure of the location of the file). For example,
<source file='/home/dir/vm.img'/>. Rename this file to the new image name, for example
- Make a copy of the XML file of the first virtual machine and rename it
for the new guest. Make the necessary changes within this XML file. The
main changes are listed below.
- Name of the image — <name>newname</name>
- Location of the new image file — <source file='/home/dir/vm.img'/>
- MAC address of the bridged interface — <mac address='61:11:11:11:fe:9a'/>
- With the new diskimage and XML file in place, use the command virsh define to construct the clone for the new VM.
- Apply any post-installation settings (such as any network setup) using the guest's console.
- Add any PCI passthrough device needed for the cloned VM guest. The method to do this is described next.
Repeat the steps above to clone the required VM guests. Once all the VM guests are created, define the required PCI devices to each guest. Note that each guest will need at least one InfiniBand* or RoCE adapter and one Fibre Channel adapter or port.
Assign the hardware devices to all the guest VMs
Identify the PCI address of the device you want to pass through to the guest, add the appropriate entry for it in the guest's XML configuration file and detach that device from the host system. This can be done through virt-manager or the command-line interface. We used the command-line interface for this example.
- Identify the devices on the host. Note the pci-device number to the left.
# lspci | grep Mellanox 04:00.0 InfiniBand: Mellanox Technologies MT26428......
- Confirm the existence of the same pci-number found above in the list
of device nodes using
virsh nodedev-list --tree. Expect to find pci_0000_04_00_0 in the output which corresponds to address 04:00.0.
- Show the PCI attributes of the device by examining the contents of the XML
# virsh nodedev-dumpxml pci_0000_04_00_0. In particular, we are interested in the bus, slot, and function values. These values need to be converted into hexadecimal to generate the proper XML in the next step.
# printf %x 4 = 4 # printf %x 0 = 0 # printf %x 0 = 0
The values to use are:
bus='0x4' slot='0x0' function='0x0'
- Edit your guest VM's XML configuration file and add the following commands
to establish VT-d PCI passthrough.
# virsh edit db2mVM1 <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0000' bus='0x4' slot='0x0' function='0x0'/> </source> </hostdev>
- Detach the PCI device from the host system.
- Resolve the symlink:
#readlink /sys/bus/pci/devices/0000\:04\:00.0/driver ../../../../bus/pci/dri vers/mlx4_core
- Detach the device from host:
#virsh nodedev-dettach pci_0000_04_00_0
- Check to see if it became a pci stub:
#readlink /sys/bus/pci/devices/0000\:04\:00.0/driver ../../../../bus/pci/drivers/pci-stub
- Resolve the symlink:
Repeat the above steps for the Fibre Channel devices. Note that when assigning the Fibre Channel device, you are only assigning one port of the adapter, unlike InfiniBand* or RoCE, where you are assigning both ports of the device. Refer to Adding a PCI Device to a Virtualized Guest with virsh web page for more details.
Reversing the PCI passthrough process is simple. The recommended method is listed below.
- Issue command
virsh edit <domain name>and copy the XML that was added to attach the host PCI device and save it in a new text file somewhere. This is useful in case you want to attach the device to the guest again in the future.
- In the virsh editor, delete the XML corresponding to the host device attachment and save the XML file.
- Issue the command
virsh nodedev-reattach <pci address>to assign the device back to the host.
- Identify the devices on the host. Note the pci-device number to the left.
Start the guest VMs
Once the device assignment is completed, start the VM using the
virsh start db2mVM1command.
If you are successful, a message will appear that the VM has been started. Use the
virsh listcommand to show which VM is currently running.
If you can see your VM in the list but can't ping it or otherwise access it, there's a good possibility its network needs to be set up. To do this, access the VM's console through virt-manager and make any required network changes.
The layout of the DB2 pureScale instances is determined by how many VM guests with required prerequisites were needed. You can create multiple instances based on InfiniBand* and RDMA over the Ethernet network. The instances are made up of members and coupling facility (CF). With the PCI-E devices assigned to guest VMs, each guest at a minimum should have either an InfiniBand* or RoCE adapter and one Fibre Channel adapter or port. Note that all guests must have access to the shared storage. Refer to the DB2 pureScale Feature web page for details on creating instances.
Based on the above solution template information, our example use three System x3850 servers, each with four sockets and eight cores per sockets. Each physical server has seven PCI slots. With three physical servers, we can create twelve KVM guests with their own dedicated Fibre Channel and RDMA-capable adapters. As shown in Figure 1, with twelve KVM guests, we create four DB2 pureScale instances. Two of the instances are InfiniBand* based and the other two are RoCE based. Each instance is made up of two members and one CF.
Figure 1. High-level infrastructure
Hardware and software
Based on the layout shown in Figure 1, the hardware that is used to configure four DB2 pureScale instances is shown in Table 3. For up-to-date supported versions of these components, consult the DB2 Information Center.
Table 3. Hardware configuration
|IBM System x3850 X5 server with 4 sockets/ 32 cores and Intel Xeon X7560
processors - 7 PCI-E Gen2 slots|
vendor_id : GenuineIntel
model name : Intel(R) Xeon(R) CPU X7560 @ 2.27GHz
cpu MHz : 1064.000
cache size : 24576 KB
cpu cores : 8
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt lahf_lm ida epb dts tpr_shadow vnmi flexpriority ept vpid
|- 2 ConnectX VPI PCIe 2.0 5GT/s - InfiniBand* QDR / 10Gb Ethernet|
- 2 ConnectX EN 10GbE, PCIe 2.0 5GT/s
- 2 ISP2532-based 8Gb Fibre Channel to PCI Express HBA
- 2 integrated 1Gb Ethernet ports
|Mellanox IS5030 QDR InfiniBand* Switch||36 ports|
|IBM System Storage SAN40B-4||40 ports|
|IBM (BNT) 10Gb Ethernet - G8124 – RoCE||24 ports|
|IBM Storwize® V7000||- 18 300GB 10K RPM 2.5" SAS HDD|
- 6 300GB 2.5" SAS SSD (eML)
Based on the DB2-supported virtualization environments for x86 and x64 architectures, the software levels shown in Figure 4 are used.
Table 4. Minimum software configuration
|Red Hat Enterprise Linux 6.2 — Host and Guest|
|IBM DB2 with pureScale feature||10.1, FP1|
To isolate the disk usage by the various VMs at the storage level, the Mdisks and pools were created as shown in Figure 2 and Figure 3, respectively. Typically, the disk layout and their sizes will depend on factors such as size of the database, log activities, and so on.
Figure 2. Mdisks by pools
For simplicity, we only used one shared disk to contain the instance, database and log data, which resided entirely on SSD drives. For production-type databases, the layout may be different. To see the recommended storage configuration, refer to the IBM DB2 pureScale Feature Information Center.
Figure 3. Volumes assigned to Mdisks
For optimal performance and workload isolation, the best practice is to have the
entire set of a VM's virtual CPUs (VCPUs) to be pinned to a set of host CPUs that
reside in the same NUMA domain or even a CPU socket. The
lscpu command will
report which host CPUs belong to which NUMA or CPU socket domain. We assigned
VCPUs to guests as shown in Figure 4. A process
running on the host is allowed to run on any CPU; however, the VCPU will always be
scheduled on a particular host CPU. We sized our VMs and number of VMs such that the
total number of VCPUs would equal the number of physical CPUs provided by the host.
As a best practice, we did not allocate more VCPUs than host CPUs because
over-committing CPU resources can be detrimental to performance. We also made sure that
each VM's VCPU set was allocated out of the same host NUMA domain.
Figure 4. Allocation of virtual CPUs to virtual machines
Virtualization has an impact on performance. To understand the impact of virtualization on a DB2 pureScale instance, we measured the throughput of an OLTP-type workload using physical and virtualized environment instances.
For the bare metal server, we created a single DB2 pureScale instance using three physical servers, where two of the servers are designated as members and one server as the CF. Given that each server has four sockets with eight cores per socket, we disabled 3 sockets to keep the host CPU count at 16 on the same NUMA node and reduced the memory to 64GB on each physical server. This is to keep similar characteristics for both physical and virtualized environments.
The TPC-E benchmark simulates the OLTP workload of a brokerage firm. We used a similar benchmark internally for the DB2 pureScale environment. The remote clients drive the workload to simulate the OLTP-type workload. The workload is the read- and-write mixture of 70 and 30, respectively. Each run spawns multiple threads and the number of clients to simulate the real-world applications accessing the DB2 pureScale database. At the end of a successful run, the benchmark produces several reports, including the transactions per second of each run. Once the measurements for the bare-metal server were collected, we then configured four virtual instances based on the sample configuration described above. The four virtualized DB2 pureScale instances have the same CPU cores, memoy and CPU flags as the physical servers.
As shown in Figure 5, the aggregate transactions per seconds for two virtual instances running simultaneously, shows 1.8 x improvement in throughput over one virtualized instance. Based on the throughput measurements, one virtualized DB2 pureScale instance using PCI passthrough has 22 percent less throughput than a bare-metal instance. Some of the key Linux performance characteristics such as CPU utilization and disk response times reflect the 22 percent performance loss. This type of trade-off is typical of virtualized systems based on our past experiences.
Figure 5. Aggregate throughput for virtualized instances
Virtualization of DB2 pureScale offers a flexible way to consolidate multiple DB2 pureScale applications on the same physical hardware and allows you to maximize utilization of this hardware. As with most workloads, the decision whether or not to virtualize is based on a trade-off between optimal performance and flexibility, utilization and cost. With DB2 pureScale, only one instance per OS is permitted, which increases the value of this flexibility. By choosing to virtualize DB2 pureScale on Red Hat Enterprise Linux KVM, you will receive the added benefits of OS isolation, CF isolation, and GPFS and disk isolation, which are not possible today in DB2 pureScale with other methods of multi-tenancy. Virtualization brings an improved infrastructure ROI for your System x servers, and this article has shown the solution template and detailed steps of how one could virtualize IBM DB2 pureScale feature on System x servers.
* InfiniBand was tested in a lab environment but is not supported for production as of this writing.
- Learn about IBM DB2 pureScale Feature usage.
- Explore the KVM – KERNEL BASED VIRTUAL MACHINE to learn about it's architecture.
- Review how workload performance of IBM DB2 on Red
Hat Enterprise Virtualization scaled equally well in increasing numbers of virtual machines/hosts and
numbers of vCPUs/guests.
- Access IBM System
for detailed information.
Get products and technologies
- Use the IBM x86 Upgrade Advisor
to find the right system to support your data center upgrade or
- Download a trial version of
DB2 for Linux, UNIX, and Windows to test it for yourself.
- Get involved in the My developerWorks
community. Connect with other developerWorks users while exploring the
developer-driven blogs, forums, groups, and wikis.
Miso has been with IBM since 2000. Over the years he has worked on diverse set of projects, with focus on DB2 in the areas of performance, integration with and exploitation of hardware and design of workload optimized systems. Currently, Miso manages the DB2 Performance Benchmarking team.