Skip to main content

Configuring SUSE Linux on POWER5 to maximize performance

Virtual SCSI and virtual Ethernet

Mike Skelton, Performance Analyst, IBM
Mike Skelton was a member of the IBM Systems and Technology Group's Linux Technology Center. He has been a performance analyst at IBM for over 20 years and has numerous patents and publications. He now works in the IBM Software Group.
Yong Cai (ycai@us.ibm.com), Performance Analyst, IBM
Yong Cai is a member of the IBM System and Technology Group. He has been a network, Java, and database performance analyst at IBM for over 8 years. Recently, he worked on virtualization and server consolidation.

Summary:  IBM POWER5™ and POWER5+™ systems provide excellent virtualization capabilities. Understand factors affecting virtualization performance of IBM POWER5 systems using SUSE Linux® Enterprise Server (SLES) 10. Learn how to use system tools that can help diagnose and solve performance problems. See examples of how to test for and improve performance.

Date:  26 Apr 2007
Level:  Intermediate
Activity:  333 views
Comments:  

Introduction

IBM System p™ virtualization features supported by the SUSE Linux Enterprise Server (SLES) 10 operating system include virtual SCSI (VSCSI) and virtual LAN (VLAN). Both VSCSI and VLAN provide configuration and tuning parameters that can improve system performance. This article highlights some tuning recommendations and identifies measurement tool features that provide data for performance monitoring and diagnosis of virtualization performance problems. For information about initial setup and configuration see Resources.

Tuning virtual SCSI

The primary way to tune VSCSI is to select the appropriate I/O scheduler. You can choose the I/O scheduler for either or both the VSCSI server drive and the VSCSI client drive. The selected scheduler should be appropriate for the workload on the drive. The installation default is the anticipatory I/O scheduler. The best I/O scheduler for the VSCSI server drive is the noop scheduler.

The I/O scheduler choices are:

  • noop - fifo queuing
  • anticipatory - anticipatory scheduling
  • deadline
  • cfq - consistently fair queuing

To find out which I/O scheduler is being used, query the /sys file system. For example, on SLES 10, use the following:

cat /sys/block/<sd*>/queue/scheduler
                  ^use drive of interest

For example using drive sda:
cat /sys/block/sda/queue/scheduler
[noop] anticipatory deadline cfq
      

In this example, [noop] is the scheduler being used. You can change the I/O scheduler in real time by echoing a value in the file shown in the above example. You can also change the scheduler at boot time by putting a line into the /etc/yaboot.conf file, such as the following:

append = "elevator=noop"

Using VSCSI tools

The generally available version of SLES 10 has a VSCSI functional bug. Get the latest kernel update for SLES 10 (2.6.16.21-025-ppc or newer) from the SUSE Linux portal (see Resources). This update includes the bug fix.

The iostat command is part of the sysstat package, available through the SUSE utility tool called yast.

  1. Start yast.
  2. Select software.
  3. Select software management.
  4. Search for sysstat.
  5. Highlight the systat package.
  6. Choose the accept action.
  7. Follow the instructions for which installation disk to load.

Running iostat -x provides a great deal information about the read and write traffic to the physical and virtual devices. See Resources to get to the manual pages to see the iostat column definitions.

Diagnosing VSCSI problems

In order to diagnose a performance problem involving the virtual SCSI disk, you need to understand the VSCSI system configuration. You need to know how virtual devices map to physical hardware.

The sum of the activity of all disk partitions on a physical SCSI disk must be within the capability of the physical SCSI device. First, look at the demand on the physical device. Then, if the demand is too high, look at the demand from each of its partitions. Adjust the virtual-to-physical mapping as needed for the demand to be within the capability of the physical hardware. This adjustment can require changing the virtual-to-physical mapping or changing or adding hardware.

Consider an example of the VSCSI server having a SCSI drive named sdc that is divided into three partitions: (sdc1, sdc2, and sdc3), each of which is virtualized. The virtualized partitions each appear as a separate drive when used by a client. The iostat tool shows utilization of a drive. iostat can measure the usage of partitions sdc1, sdc2, and sdc3 as drives on the clients. The best approach is to run iostat first on the VSCSI server to show the total utilization of sdc. Then run iostat on each client to get the utilization of the partitions sdc1, sdc2, and sdc3.


Figure 1. Virtual SCSI example
Virtual SCSI example

Understanding a measurement example

Take a look at a measurement example that shows why you need to understand how the physical drives map to the virtual disks. As shown in Figure 1, the VIO server boots from sda. The VIO server then virtualizes sdb and the three partitions on sdc. The VIO client boots from sda then mounts the three virtualized partitions as sdb, sdc, and sdd.

The workload for this example is provided by Flexible File System Benchmark (FFSB) (see Resources for where to find the tool), which is an open source tool that can easily be configured to provide a variety of read, write, sequential, and random patterns with additional threading options. For this example, FFSB is configured to evaluate the performance of the client disks sdb, sdc, and sdd. Large sequential reads are done on sdb. Small random reads are done on sdc. Sequential writes are done on sdd, as shown in Listing 1. The iostat tool measures the resulting behavior. The benchmarks are first run on each disk separately, then the benchmark is run on all three drives concurrently.

Listing 1 shows the output of iostat for large sequential reads on sdb.


Listing 1. iostat output for large sequential reads
                
avg-cpu:  %user    %nice   %system %iowait      %steal    %idle
           0.30     0.00    6.35    93.30       0.00      0.00

Device:   rrqm/s   wrqm/s    r/s     w/s        rsec/s    wsec/s     rkB/s    
sda        0.00     0.45    0.10    0.10        0.80      4.80       0.40     
sdb        1.17     0.80  173.36    0.40    82818.59     11.19   41409.30     
sdc        0.00     0.00    0.00    0.00        0.00      0.00       0.00     
sdd        0.00     0.00    0.00    0.00        0.00      0.00       0.00     

Device:    wkB/s   avgrq-sz   avgqu-sz    await   svctm   %util
sda        2.40      28.00      0.00      5.00     5.00    0.10
sdb        5.60     476.68    102.99    592.19     5.75  100.00
sdc        0.00       0.00      0.00      0.00     0.00    0.00
sdd        0.00       0.00      0.00      0.00     0.00    0.00

Listing 2 shows the output of running iostat for small random reads on sdc.


Listing 2. iostat output for small random reads
                
avg-cpu:  %user    %nice   %system %iowait      %steal    %idle
           0.05     0.00    0.95    99.00       0.00      0.05

Device:   rrqm/s   wrqm/s    r/s     w/s        rsec/s    wsec/s     rkB/s    
sda        0.00     0.45    0.00    0.10        0.00      4.80       0.00     
sdb        0.00     0.00    0.00    0.00        0.00      0.00       0.00     
sdc        0.00     0.80   98.85    0.40     2390.80     11.19    1195.40     
sdd        0.00     0.00    0.00    0.00        0.00      0.00       0.00     
 
Device:    wkB/s   avgrq-sz   avgqu-sz    await   svctm   %util
sda        2.40      48.00      0.00      0.00     0.00    0.00
sdb        0.00       0.00      0.00      0.00     0.00    0.00
sdc        5.60       8.03     31.56    105.50     3.34  100.00
sdd        0.00       0.00      0.00      0.00     0.00    0.00

Finally, Listing 3 shows results from iostat for random writes on sdd.


Listing 3. iostat output for random writes
                
avg-cpu:  %user    %nice   %system %iowait      %steal    %idle
           0.00     0.00    1.00    98.95       0.00      0.00

Device:   rrqm/s   wrqm/s    r/s     w/s        rsec/s    wsec/s     rkB/s
sda        0.00     0.45    0.10    0.55        0.80      8.40       0.40     
sdb        0.00     0.00    0.00    0.00        0.00      0.00       0.00     
sdc        0.00     0.00    0.00    0.00        0.00      0.00       0.00     
sdd        0.00   27.69     0.00  455.97        0.00   3866.87       0.00  

Device:    wkB/s   avgrq-sz   avgqu-sz    await   svctm   %util
sda        4.20     14.15       0.01     15.38     4.62    0.30
sdb        0.00      0.00       0.00      0.00     0.00    0.00
sdc        0.00      0.00       0.00      0.00     0.00    0.00
sdd     1933.43      8.48     143.49    297.71     2.19   99.95


The client-side iostat of all three benchmarks concurrently shows that the throughput of the individual drives when run concurrently is less than that of when the drives are run independently. Listing 4 shows the concurrent iostat results for all three benchmarks.


Listing 4. Concurrent results for all three benchmarks
                

avg-cpu:  %user    %nice   %system %iowait      %steal    %idle
           0.05     0.00    1.60    98.30       0.00      0.00

Device:   rrqm/s   wrqm/s    r/s     w/s        rsec/s    wsec/s     rkB/s
sda        0.00     0.10    0.05    0.20        0.40      2.80       0.20     
sdb        1.10     0.40   14.84    0.25     7084.86      7.60    3542.43     
sdc        0.00     0.75   52.22    0.45      417.79     11.19     208.90     
sdd        0.00    14.89    0.00  312.29        0.00   2618.29       0.00

Device:    wkB/s   avgrq-sz   avgqu-sz    await   svctm   %util
sda        1.40     12.80       0.00      8.00     6.00    0.15
sdb        3.80    469.93      91.77   4818.38    66.26  100.00
sdc        5.60      8.14      30.72    641.45    18.98  100.00
sdd     1309.15      8.38     142.79    451.13     3.20  100.00

Now look at the utilization of the physical disk on the server. Recall from Figure 1 that the server's drive sdc has three partitions that the client uses as drives sdb, sdc, and sdd. The server-side iostat measurements show that the utilization of the physical disk drive sdc is 100% utilized, as shown in Listing 5.


Listing 5. Server-side iostat measurements showing physical disk drive utilization
                
avg-cpu:  %user    %nice   %system %iowait      %steal    %idle
           0.00     0.00    1.40     0.00       0.00     98.60

Device:   rrqm/s   wrqm/s    r/s     w/s        rsec/s    wsec/s     rkB/s
sda        0.00     0.00    0.00    0.10        0.00      0.80       0.00
sdb        0.00     0.00    0.00    0.00        0.00      0.00       0.00
sdc        0.00     0.00   61.97  322.04     6821.79   2704.65    3410.89

Device:    wkB/s   avgrq-sz   avgqu-sz    await   svctm   %util
sda        0.40      8.00       0.00     15.00    15.00    0.15
sdb        0.00      0.00       0.00      0.00     0.00    0.00
sdc     1352.32     24.81      27.50     70.54     2.60  100.00

This example demonstrates the impact of contention for a physical drive's resources on its throughput and response time. When throughput or response time degrades on VSCSI devices, look at the utilization of the physical device to see if contention might be the cause.

Tuning the virtual LAN

Virtual LAN is a function of the POWER Hypervisor™ that enables secure communication between logical partitions without the need for a physical I/O adapter. When TCP/IP communication data flows between LPARs on VLAN, the TCP/IP tuning parameters affect the performance of the data flow. You can use a set of tuning parameters that work well in this environment without a physical I/O adapter. Listing 6 shows the tuning recommendation for VLAN performance.


Listing 6. Tuning recommendation for MTU1500
                
        /sbin/sysctl -w net.ipv4.tcp_timestamps=1
        /sbin/sysctl -w net.ipv4.tcp_sack=1
        /sbin/sysctl -w net.ipv4.tcp_window_scaling=1
        /sbin/sysctl -w net.core.netdev_max_backlog=3000
        /sbin/sysctl -w net.ipv4.tcp_wmem='4096 87380   30000000'
        /sbin/sysctl -w net.ipv4.tcp_rmem='4096 87380   30000000'
        /sbin/sysctl -w net.ipv4.ip_local_port_range='8096      131072'
        /sbin/sysctl -w net.core.rmem_max=10485760
        /sbin/sysctl -w net.core.rmem_default=10485760
        /sbin/sysctl -w net.core.wmem_max=10485760
        /sbin/sysctl -w net.core.wmem_default=10485760
        /sbin/sysctl -w net.core.optmem_max=10000000
        echo 128 > /sys/class/net/eth0/weight
        echo 128 > /sys/class/net/eth1/weight

Using the VLAN tool

The primary tool for network analysis with VLAN is the netstat tool that can display a large amount of information about the networking system. Two of the most useful outputs are interface information and network statistics. You can display the network interface information using netstat -i and the TCP/IP protocol statistics using netstat -s.

Diagnosing VLAN problems

One tool used to measure maximum TCP bandwidth is iperf (see Resources to go to the National Laboratory for Applied Network Research Web site).

For this example, iperf was used to check VLAN bandwidth on a POWER5 system configuration, including a four-processor computer using 0.5 physical CPU for each server and client partition. Simultaneous multithreading (SMT) was turned on, and the system employed 2 GB of memory. The reported throughput was only about 500 Mbits/sec. It should have been around 1000 Mbits/sec for the gigabit adapter used. Listing 7 shows iperf throughput and vmstat output. vmstat is a Linux real-time performance monitoring tool. vmstat reports CPU idle percentage in the next to last column labeled id. CPU utilization is calculated as 100% - CPU idle.


Listing 7. iperf throughput and vmstat output
                

[root@power] /iperf_202/iperf-2.0.2/src > ./iperf -c en0host2 -w 1024KB -N
------------------------------------------------------------
Client connecting to en0host2, TCP port 5001
TCP window size:   256 KByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[  3] local 192.168.1.1 port 55990 connected with 192.168.1.2 port 5001
[  3]  0.0-10.0 sec    632 MBytes    530 Mbits/sec

vmstat output:

[root@power] /root > vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0      0 360088  75136 681724    0    0     1     3   13    37  1  6 93  0
 0  0      0 360088  75136 681724    0    0     0     0    6    18  0  0 100  0
 0  0      0 360088  75136 681724    0    0     0     0    9    12  0  0 100  0
 0  0      0 360088  75136 681724    0    0     0     0    6    10  0  0 100  0
 3  0      0 359684  75136 681724    0    0     0     0  358    96  0  2 98  0
 1  0      0 359808  75136 681724    0    0     0     0 14774  1464  0 63 37  0
 1  0      0 359684  75136 681724    0    0     0     8 13913  1452  0 64 36  0
 1  0      0 359808  75136 681724    0    0     0     0 14676  1359  1 65 35  0
 1  0      0 359544  75136 681724    0    0     0     8 14260  1598 12 67 20  0
 1  0      0 359668  75136 681724    0    0     0     0 12198  1882  0 62 38  0
 2  0      0 359544  75136 681724    0    0     0     0 13844  1435  1 63 37  0
 1  0      0 359544  75136 681724    0    0     0     0 14808  1372  0 64 37  0
 1  0      0 359668  75136 681724    0    0     0     0 13934  1454  0 62 37  0
 1  0      0 359700  75136 681724    0    0     0     0 11327  1886  0 64 35  0
 0  0      0 359576  75136 681724    0    0     0     0 14650  1343  0 60 40  0

The partition running the test was only allocated 0.5 physical CPU, which made the CPU utilization vmstat measured appear to be very high. This leads you to conclude that iperf is really CPU bound. The system was reconfigured for one physical CPU for each server and client partition. Running the test again on the newly configured system gave the improved result shown in Listing 8.


Listing 8. Improved iperf test results
                

[root@power] /iperf_202/iperf-2.0.2/src > ./iperf -c en0host2 -w 1024KB -N
------------------------------------------------------------
Client connecting to en0host2, TCP port 5001
TCP window size:   256 KByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[  3] local 192.168.1.1 port 39856 connected with 192.168.1.2 port 5001
[  3]  0.0-10.0 sec  1.22 GBytes  1.05 Gbits/sec

With the additional CPU power, the benchmark can drive the Ethernet link at full speed. With this knowledge, the system administrator can choose the appropriate CPU resource allocation.

Conclusion

Understanding the virtual SCSI and virtual LAN features of IBM System p as supported by SLES 10 can help the administrator tune the system for better performance. This article showed that resource contention of a physical disk can cause throughput or response time degradation of VSCSI devices. Similarly, CPU constraint can limit performance of VLAN. Both situations can be relieved by adding physical resources to back the virtual device.


Resources

Learn

Get products and technologies

Discuss

About the authors

Mike Skelton was a member of the IBM Systems and Technology Group's Linux Technology Center. He has been a performance analyst at IBM for over 20 years and has numerous patents and publications. He now works in the IBM Software Group.

Yong Cai is a member of the IBM System and Technology Group. He has been a network, Java, and database performance analyst at IBM for over 8 years. Recently, he worked on virtualization and server consolidation.

Comments



Trademarks

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=215742
ArticleTitle=Configuring SUSE Linux on POWER5 to maximize performance
publish-date=04262007
author1-email=skelton@us.ibm.com
author1-email-cc=pdreyfus@us.ibm.com
author2-email=ycai@us.ibm.com
author2-email-cc=pdreyfus@us.ibm.com