To understand the performance of ephemeral storage in a SmartCloud environment, we set up a benchmark test using vdbench. vdbench is an I/O workload generator for verifying data integrity and measuring performance of direct-attached and net-connected storage. It's a free tool, is easy to use, and it's commonly used for testing and benchmarking.
What we learned is that the cache is an important element in I/O benchmarking. This article walks you through the steps we took to set up our benchmark and shows you the results. But first, let's look briefly at the products and tools we used.
We started with SmartCloud, a public cloud service based on Kernel-based Virtual Machine (KVM), a full virtualization solution for Linux® on x86. SmartCloud is a complete Infrastructure as a Service (IaaS) offering. You can use it as a foundation to build customized Platform as a Service (PaaS) and Software as a Service (SaaS) solutions.
Key features of SmartCloud include:
- Self-service provisioning of virtual server machines and virtual storage space (persistent storage)
- Networking capabilities
- Pay-per-use billing policy
- Automatic provisioning of resources
- Open API to develop scripts and software to enhance automation
- VM sizing spans from one virtual CPU and 2GB of memory up to 16 CPUs and 32GB of RAM.
- Both Windows® and Linux operating systems can be deployed, including Red Hat Enterprise Linux and SUSE Linux.
SmartCloud offers three types of storage:
- Ephemeral storage is associated with a VM when it's provisioned.
- Persistent storage is a network-attached storage accessed by dynamic attach/detach to an active instance.
- Object storage, provided in collaboration with Nirvanix, can be configured as a storage-on-demand solution for unstructured data.
In our benchmarking tests, we focused on ephemeral storage.
We used vdbench to test raw disks and file systems. It has a web user interface for detailed performance reports. vdbench was developed by Henk Vandenbergh of Sun Microsystems, formerly StorageTek. vdbench is written in Java®. It's been tested on Solaris, Windows, HP-UX, AIX, Linux, Mac OS X, zLinux, and native VMware.
Ephemeral storage is created when a VM is provisioned. Its lifecycle is directly related to the instance to which it is bound. Ephemeral storage is created from the local disk that resides in the nodes. Throughput varies greatly, based on what else is going on in the shared infrastructure.
The size of an ephemeral storage starts at 60GB up to 2048GB for the largest VM that's provisioned. If full instance storage isn't necessary, you can provision virtual machines with a minimal amount of ephemeral storage (60 GB). Provisioning with minimal storage can reduce provisioning times for large virtual machine types. Virtual machine instance storage is erased when the instance is deleted.
Figure 1. Provisioning a VM with ephemeral storage
In contrast, to store data for longer periods, blocks of persistent storage are available. Compared to ephemeral storage, a persistent storage has no lifetime coupled to a VM, and it's billed independently. Persistent storage can be dynamically attached or detached from a VM, as it is a network-attached storage (NAS) raw disk that must be formatted and mounted from within the guest, as illustrated below.
Figure 2. Persistent storage through a network-attached VM
In our benchmark testing, we created ephemeral storage by provisioning a VM using Red Hat Linux. It looks like this:
Figure 3. Provisioning a VM for ephemeral storage using Red Hat Linux
You should have SmartCloud installed and running. Next, you must allocate a virtual block device and an ext3 file system, and then install vdbench.
Allocating a virtual block device
The following code allocates a virtual block device:
Listing 1. Allocating a virtual block device
[idcuser@vhost1291 /]$ dd if=/dev/urandom of=/home/idcuser/disk1.raw bs=512 count=2097152
oflag=sync,direct
[idcuser@vhost1291 ~]$ sudo parted disk1.raw mklabel msdos
[idcuser@vhost1291 ~]$ sudo losetup -f disk1.raw
[idcuser@vhost1291 ~]$ sudo losetup -a
/dev/loop0: [fc02]:234430 (/home/idcuser/disk1.raw)
[idcuser@vhost1291 ~]$ sudo fdisk /dev/loop0
Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-130, default 1):
Using default value 1
Last cylinder, +cylinders or +size{K,M,G} (1-130, default 130):
Using default value 130
Command (m for help): w
[idcuser@vhost1291 ~]$ sudo fdisk -ul /dev/loop0
Disk /dev/loop0: 1073 MB, 1073741824 bytes
255 heads, 63 sectors/track, 130 cylinders, total 2097152 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0008f4b6
Device Boot Start End Blocks Id System
/dev/loop0p1 63 2088449 1044193+ 83 Linux
[idcuser@vhost1291 ~]$ sudo losetup -o 32256 -f /dev/loop0
|
[idcuser@vhost1291 ~]$ sudo losetup -a /dev/loop0: [fc02]:234430 (/home/idcuser/disk1.raw) /dev/loop1: [0005]:5396 (/dev/loop0), offset 32256 [idcuser@vhost1291 ~]$ sudo mkfs -t ext3 /dev/loop1 mke2fs 1.41.12 (17-May-2010) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 65536 inodes, 262136 blocks 13106 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=268435456 8 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376 Writing inode tables: done Creating journal (4096 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 36 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. [idcuser@vhost1291 ~]$ sudo losetup -d /dev/loop1 [idcuser@vhost1291 ~]$ sudo losetup -d /dev/loop0 [idcuser@vhost1291 ~]$ mkdir partition1 [idcuser@vhost1291 ~]$ sudo mount -t ext3 -o sync,loop,offset=32256 disk1.raw partition1/ |
Allocating an ext3 file system
The following code allocates an ex3 file system:
Listing 2. Allocating an ext3 file system
[idcuser@vhost1291 ~]$ sudo dd if=/dev/urandom of=fs_1GB.ext3 bs=512 count=2097152 oflag=sync,direct [idcuser@vhost1291 /]$ sudo chmod 664 fs_1GB.ext3 [idcuser@vhost1291 /]$ sudo mkfs -t ext3 -q fs_1GB.ext3 fs_1GB.ext3 is not a block special device. Proceed anyway? (y,n) y [idcuser@vhost1291 /]$ sudo mount -t ext3 -o sync,loop fs_1GB.ext3 /mnt/fs_mount [idcuser@vhost1291 mnt]$ fdisk -l fs_1GB.ext3 Disk /fs_1GB.ext3: 0 MB, 0 bytes 255 heads, 63 sectors/track, 0 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000 Disk /fs_1GB.ext3 doesn't contain a valid partition table [idcuser@vhost1291 /]$ sudo umount -l /mnt/fs_mount/ [idcuser@vhost1291 /]$ sudo losetup -d /dev/loop0 |
Now it's time to download and install vdbench.
- Go to Sourceforge and download the tool.
- Transfer it with WinSCP tool to the VM in SmartCloud. Unpack it into the /var/www/html directory of the Linux VM.
Listing 3. Unpacking WinSCP[root&vhost4377 idcuser]# mv vdbench502.zip /var/www/html/ [root&vhost4377 idcuser]# cd /var/www/html/ [root&vhost4377 html]# mkdir vdbench502 [root&vhost4377 html]# unzip vdbench502.zip –d vdbench502/ [root&vhost4377 html]# cd vdbench502
- To run the tool, you need to install the Java Runtime Environment, in this case Oracle
JDK.
Listing 4. Installing Java Runtime Environment[idcuser&vhost4377 ~]$ sudo –s | cd [root&vhost4377 ~]# wget http://download.oracle.com/otn-pub/java/jdk/7u3-b04/jdk-7u3-linux-i586.rpm [root&vhost4377 ~]# rpm –ivh jdk-7u3-linux-i586.rpm [root&vhost4377 ~]# vi .bashrc JAVA_HOME=/usr/java/jdk1.7.0_03 CLASSPATH=.:$JAVA_HOME/lib/tools.jar PATH=$JAVA_HOME/bin:$PATH export JAVA_HOME CLASSPATH PATH [root&vhost4377 ~]# source .bashrc
- Before you start the benchmark, use default parameters to see if the installation is OK.
[root&vhost4377 vdbench502]# ./vdbench -t
To get more control over the tool parameters, write a parmfile where you can specify different options.
For a virtual block device
For a virtual block device, configure these parameters:
HD: Host Definition
- hd= localhost if you want to bench the current host. label if you want to specify a remote host.
- system= IP address or network name.
- clients= # to simulate the running of multiple clients for the server.
SD: Storage Definition
- sd= name to identify the storage.
- host= ID of the host where the storage can be found.
- lun= name of the raw disk or tape or file system. vdbench can also create a disk for you.
- threads= # max of concurrent I/O requests for the SD. Default is 8.
- hitarea= size to tune the read hit percentage. Default 1m.
- openflags= flag_list to use opening a lun or a file.
WD: Workload Definition
- wd= name to identify the workload.
- sd= ID of the storage definition to use.
- host= ID of the host to run this workload on. Default is localhost.
- rdpct= # percentage of read requests over the total.
- rhpct= # read hit percentage. Default is 0.
- whpct= # write hit percentage. Default is 0.
- xfersize= size of data to transfer. Default is 4k.
- seekpct= # percentage of random seeks. Can be random.
- openflags= flag_list to use opening a lun or a file.
- iorate= # fixed I/O rate for this workload.
RD: Run Definition
- rd= name to identify the run.
- wd= ID of the workload to use for this run.
- iorate= (#,#,...) one or more I/O rates.
- curve : performance curve (to be defined).
- max : uncontrolled workload.
- elapsed= time : duration in seconds of the run. Default is 30.
- warmup= time : warmup period, ignored by the final means.
- distribution= of the I/O requests: exponential, uniform, or deterministic.
- pause= time in seconds to sleep before the next run.
- openflags= flag_list to use opening a lun or a file.
For a file system
For a file system, configure these parameters:
HD: Host Definition. Same as for a virtual block device.
FSD: File System Definition
- fsd= name to identify the file system definition
- anchor= directory where the directory structure will be created
- width= # directory to create under the anchor
- depth= # levels to create under the anchor
- files= # files to create at the lowest level
- sizes= (size,size,...) size of the files that will be created
- distribution= bottom if you want to create files only at the lowest level, all if you want to create files in all directories
- openflags= flag_list to use opening a file system (Solaris)
FWD: File System Workload Definition
- fwd= name to identify the file system workload definition.
- fsd= ID of the file system definition to use.
- host= ID of the host to use with this workload.
- fileio= random or sequential, how file I/O will be done.
- fileselect= random or sequential, how to select files or directory.
- xfersizes= size for data transfer (read and write operations).
- operation= mkdir, rmdir, create, delete, open, close, read, write, getattr and setattr. Choose a single file operation to do.
- rdpct= # percentage of read and write operation (only).
- threads= # concurrent threads for this workload.Needed at least 1 file each thread)
RD: Run Definition
- fwd= ID of the file system workload definition to use.
- fwdrate= # file system opertation per second.
- format= yes / no / only / restart / clean / directories. Operation to do before starting the run.
- operations= override the fwd operations. Same options.
Output folder files after run
After every run, vdbench creates an output folder containing these files:
- errorlog.html
- When data validation is enabled for the test, it can contain information about errors in some data blocks:
- Invalid key(s) read
- Invalid lba read (logical byte address of a sector)
- Invalid SD or FSD name read
- Data corruption even when using wrong lba or key
- Data corruption
- Bad sectors
- flatfile.html
- Contains vdbench-generated information in a column-by-column ASCII format.
- histogram.html
- A response time, text formatted file reporting histograms.
- logfile.html
- Contains a copy of each line of information that has been written by the Java code to the console window. Logfile.html is primarily used for debugging purposes
- parmfile.html
- Shows the final results of everything that has been included to make the test
- resourceN-M.html, resourceN.html, resourceN.var_adm_msgs.html
-
- Summary report
- stdout/stderr report
- Host N summary report(s)
- Last 'nn' lines of files /var/adm/messages and /var/adm/messages. 0 on the target host N for each M JVM/Slave's and for host N.
- sdN.histogram.html, sdN.html
- Histogram and storage definition "N" report for each N storage definition.
- summary.html
- The main report file that shows the total workload generated for each run per reporting interval, and the weighted average for all intervals except the first.
- interval: Reporting interval sequence number
- I/O rate: Average observed I/O rate per second
- MB sec: Average number of megabytes of data transferred
- bytes I/O: Average data transfer size
- read pct: Average percentage of reads
- resp time: Average response time measured as the duration of the read/write request. All vdbench times are in milliseconds.
- resp max: Maximum response time observed in this interval. The last line contains total max.
- resp stddev: Standard deviation for response time
- cpu% sys+usr: Processor busy = 100 (system + user time) (Solaris, Windows, Linux)
- cpu% sys: Processor utilization: system time
- swat_mon.txt, swat_mon_total.txt
-
- vdbench, in cooperation with the Sun StorageTekTM Workload Analysis Tool (Swat) Trace Facility (STF) allows you to replay the I/O workload of a trace created using Swat.
- A trace file created and processed by Swat using the Create Replay File option creates file flatfile.bin (flatfile.bin.gz for vdbench403 and up) which contains one record for each I/O operation identified by Swat.
- These files contain a formatted report that can be imported into Swat Performance Monitor (SPM) for the creation of performance charts.
Benchmarking tests and results
We ran three different tests.
In the first test, a parmfile is used to define a single run test against a single raw disk.
The storage called sd1 points to the device /dev/rdsk/c0t0d0s0 with a size of 4MB.
The workload wd1 is defined to run over storage sd1 with 100% of read operations, each involving a 4KB block.
The run is defined in order to use the workload wd1 at a rate of 100 ops/sec for 10 seconds.
Listing 5. Parmfile defining a single run test against a single raw disk
*SD: Storage Definition *WD: Workload Definition *RD: Run Definition * sd=sd1,lun=/dev/rdsk/c0t0d0s0,size=4m wd=wd1,sd=sd1,xfersize=4096,readpct=100 rd=run1,wd=wd1,iorate=100,elapsed=10,interval=1 *Single raw disk, 100% random read of 4KB blocks at I/O rate of 100 for 10 seconds |
During this test, the parmfile defines a test over a file system and against the remote host 129.35.213.249. The HD includes the full path of vdbench installed on the remote host, the shell to use for the communication (ssh or own rsh).
In this case, vdbench means rsh. User=root is the user that owns the process on the remote host. The file system is defined to support the creation of a structure with three levels, each containing two directories and two files (thanks to the option distribution=all). The workload is made up of 80% of read operation (20% writes), a random file selection, and random file I/O. Operations involve 4KB blocks.
The run used this workload at the maximum rate for 600 seconds, sampling each second and formatting the file system at start.
Listing 6. Parmfile defining a test over a file system and against a remote host
hd=resource1,system=129.35.213.249,vdbench=/var/www/html/vdbench,shell=vdbench,user=root fsd=fsd1,anchor=/fs,width=2,depth=3,files=2,distribution=all, size=4m,openflags=(o_dsync,o_rsync) fwd=fwd1,fsd=fsd1,host=resource1,xfersize=4096,operation=read,rdpct=80, fileselect=random,fileio=random,threads=1 rd=run1,fwd=fwd1,fwdrate=max,format=yes,elapsed=600,interval=3 |
In this test, the parmfile defines a SmartCloud host over which to execute the test. The storage is an ad-hoc-made virtual block device of 1GB called disk1.raw.
Settings are all oriented to reproduce the worst possible situation in which we have no
hit area to speed up the response time, and with the open-flags used to open files or devices in order to avoid the use of buffer cache (o_direct).
The workload definition strengthens the concept to avoid both read and write hits
(rhpct=0whpct=0,) and a percentage of random seek of 100%.
The test is executed for 1500 sec 11 times following a defined curve. The first run
attempts to discover the max i/o rate. The other runs increase the I/O rate by 10% from 10 to 100. Each run produces a workload with exponential distribution.
Listing 7. Parmfile defining a SmartCloud host over which to execute the test
hd=resource1,system=129.35.209.189,vdbench=/var/www/html/vdbench503, shell=vdbench,user=root sd=sd1,lun=/home/idcuser/disk1.raw,hitarea=0m,offset=32256, openflags=o_direct wd=wd1,sd=sd1,host=resource1,xfersize=4096,rdpct=40,rhpct=0,whpct=0, seekpct=100 rd=run1,wd=wd1,iorate=curve,curve=(10-100,10),format=yes,elapsed=1500, warmup=18,distribution=exponential,pause=60,interval=6,threads=1 |
The VMs used in the benchmark study are:
- 1 x Client VM Red Hat Enterprise Linux 6.1 Silver 32_bit (2 vCPU, RAM: 4GB, Disk: 410 GB)
- 1 x Resource VM Red Hat Enterprise Linux 6.1 Copper 32_bit (1 vCPU, RAM: 2GB, Disk: 60GB)
Figure 4. VMs in the benchmark study
Figure 5 shows the response time of the read operations (in green) and the response time of the write operations (in yellow) with cache enabled (without o_direct flag). It's clear that reads are quite stable while writes tend to saturate near the 60% of the value averaged by the initial "uncontrolled curve" run. With cache enabled, the first run ("unc") is biased by the first access to blocks, so it's not representative of the real performance of the following runs.
Figure 5. Response times
Figure 6 shows the same things but without the buffer cache enabled. The trend is not regular because the performance is more affected by the real underlying utilization of the disk. In any event, it is clear that the order of measure is higher than the case with cache enabled. So, results are better and more stable with cache enabled.
Figure 6. Results without buffer cache enabled
Figure 7 shows the average rate in MB/sec of the total I/O mix of operations. This cache is enabled. You can see that the trend is linear and very regular. It is important to note that the graph shows a mean for each run, coming from a mean of the first uncontrolled run. There are values higher than 7MB/sec, sometimes near 60MB/sec and with 60% of write operations; quite good results.
Figure 7. Average rate in MB/sec of the total I/O mix of operations
Figure 8 shows the same things but with cache buffer disabled. As in the previous graph, the trend is not regular. You can see a drop in performance for the run 80% and 90%. You can guess that there has been a relevant use of the disk by another VM over the same cloud node. Finally, you can see that the rate order of measure is much lower than the case with cache enabled.
Figure 8. Results with cache buffer disabled
What issues arose from the tests? And what did we learn?
The cache is a crucial point in I/O benchmarking. Cache is a very complicated variable to deal with because it depends on so many factors. Cache influences read/write performance but only if multiple access to a single block occurs or when the delayed write is enabled.
Sometimes you can be interested in the real and mechanical performances of a storage device and test the worst situation in which it can be used. To do that, every OS allows opening a file with different rules.
With O_DIRECT the kernel performs DMA directly from/to the physical memory pointed [to] by the userspace buffer passed as [a] parameter to the read/write syscalls. So there is no CPU and memory bandwidth spent in the copies between user space memory and kernel cache, and there is no CPU time spent in kernel in the management of the cache (like cache lookups, per-page locks, and so forth).
This surely works with a classic infrastructure, as illustrated in Figure 9.
Figure 9. A classic infrastructure
What happens instead with a virtualized infrastructure using SmartCloud, for example, as illustrated in Figure 10?
Figure 10. A virtualized infrastructure
During these tests, we learned that disabling cache support is not a good choice in almost every situation. It makes sense to disable the cache only for benchmarking purposes.
In our study the idea was to discover the behavior of a virtualized shared cloud infrastructure: If the cache is enabled, the performance can be similar to a private virtual infrastructure with exclusive use.
Within a cloud environment, there are many unpredictable variables influenced by the usage by other clients; this can affect the I/O performance in general.
In a shared architecture with no cache enabled option, these negative effects are amplified so it becomes very difficult to deduce an expected response time or I/O rate. The more cache memory reserved to the virtual machine, the better and more stable performance.
Starting from our preliminary assumption and results, it would be interesting to enlarge the benchmark and to perform multi-I/O-class tests over a single resource station VM. This extension can be done using vdbench.
It would be interesting to expand the benchmark to include different test sessions and compare the three different storage types that SmartCloud offers; in doing so we can create a guide for the SmartCloud user.
Learn
- See the developerWorks article "Understanding ephemeral storage".
-
Explore developerWorks Cloud computing, where you will find valuable community discussions and learn
about new technical resources related to the cloud.
-
Follow developerWorks on Twitter.
-
Watch developerWorks demos ranging from product installation and setup demos for beginners
to advanced functionality for experienced developers.
Get products and technologies
-
Download vdbench, the I/O workload generator.
-
Download WinSCP, an SFTP client and FTP client for Windows.
-
Evaluate IBM products in
the way that suits you best: Download a product trial, try a product online, use a
product in a cloud environment, or spend a few hours in the
SOA
Sandbox learning how to implement service-oriented architecture efficiently.
Discuss
-
Get involved in the developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.
Mariano Ammirabile has extensive experience in IT thanks to several technical and business development positions in IBM and Accenture in Italy, southern Europe, the United States, and France. He has expertise in IBM SmartCloud, with a focus on cloud public services.
Luca Maestri is a student in electronics engineering C7o University Politecnico of Milano, with a specialty in computer science. During his studies, he spent six months on scholarship at IBM where he focused on SmartCloud Enterprise. Currently he is completing his university studies and is an on-site promoter in a startup company in Italy.




