This three-part series (see Resources) on the AIX® disk and I/O subsystem focuses on the challenges of optimizing disk I/O performance. While disk tuning is arguably less exciting than CPU or memory tuning, it is a crucial component in optimizing server performance. In fact, partly because disk I/O is your weakest subsystem link, there is more you can do to improve disk I/O performance than on any other subsystem.
The first
and second
installments of this series discussed the importance of
architecting your systems, the impact it can have on overall system performance,
and a new I/O tuning tool, lvmo, which you can use to tune logical volumes. In
this installment, you will examine how to tune your systems using the
ioo command, which configures the majority of all I/O
tuning parameters and displays the current or next boot values for all I/O tuning
parameters. You will also learn how and when to use the filemon and fileplace
tools. With enhanced journaled file system, the default file system within AIX, improving your overall file system performance, tuning your file systems, and getting the best out of the JFS2 are all important parts of your tuning toolkit.
You'll even examine some file system attributes, such as sequential and random
access, which can affect performance.
This section discusses JFS2, file system performance, and specific performance
improvements over JFS. As you know, there are two types of kernels in AIX. There
is a 32-bit kernel and a 64-bit kernel. While they both share some common
libraries and most commands and utilities, it is important to understand their
differences and how the kernels relate to overall performance tuning. JFS2 has
been optimized for the 64-bit kernel, while JFS is optimized for the 32-bit
kernel. Journaling file systems, while much more secure, historically have been
associated with performance overheads. In a Performance Rules shop (at the expense
of availability), you would disable metadata logging to increase
performance with JFS. With JFS2, you can also disable logging (in AIX 6.1 and
higher) to help increase performance. You can disable logging at the point of
mounting the filesystem, which means that you don't need to worry about changing
or reconfiguring the filesystem. You can instead just modify your mount options.
For example, to disable logging on filesystem you would use the following: mount -i log=NULL /database.
Although JFS2 was optimized to improve the performance of metadata operations, that is, those normally handled by the logging framework, switching logging off can have a significant performance benefit for filesystems where there is a high proportion of file changes and newly created/deleted files. For example, filesystems on development filesystems may see an increase in performance. For databases where the files used are static, the performance improvement may be less significant.
However, you should be careful making use of compression. Although compression can save disk space (and disk reads and writes, since less data is physically read from or written to the disk), the overhead on systems with a heavy CPU loads can actually slow performance down.
Enhanced JFS2 uses a binary tree representation while performing inode searches, which is a much better method than the linear method used by JFS. Furthermore, you do not need to assign inodes anymore when creating file systems, as they are now dynamically allocated by JFS2 (meaning you won't be running out of them).
While concurrent I/O was covered in the first installment of the series, it's worth another mention here. Implementation of concurrent I/O allows multiple threads to read and write data concurrently to the same file. This is due to the way in which JFS2 is implemented with write-exclusive inode locks. This allows multiple users to read the same file simultaneously, which increases performance dramatically when multiple users read from the same data file. To turn concurrent I/O on, you just need to mount the f/s with the appropriate flags (see Listing 1). We recommend that you look at using concurrent I/O when using databases such as Oracle.
Listing 1. Turning on concurrent I/O
root@lpar29p682e_pub[/] mount -o cio /test
root@lpar29p682e_pub[/] > df -k /test
Filesystem 1024-blocks Free %Used Iused %Iused Mounted on
/dev/fslv00 131072 130724 1% 4 1% /test
|
Table 1 illustrates the various enhancements of JFS2 and how they relate to systems performance. It's also important to understand that when tuning your I/O systems, many of the tunables themselves (you'll get into that later) differ, depending on whether you are using JFS or JFS2.
Table 1. Enhancements of JFS2
| Function | JFS | JFS2 |
|---|---|---|
| Compression | Yes | No |
| Quotas | Yes | Yes |
| Deferred update | Yes | No |
| Direct I/O support | Yes | Yes |
| Optimization | 32-bit | 64-bit |
| Max file system size | 1 terabyte | 4 petabytes |
| Max file size | 64 gigabyes | 4 petabytes |
| Number of inodes | Fixed when creating f/s | Dynamic |
| Large file support | As mount option | Default |
| On-line defragmentation | Yes | Yes |
| Namefs | Yes | Yes |
| DMAPI | No | Yes |
This section introduces two important I/O tools, filemon and fileplace, and discusses how you can use them during systems administration each day.
Filemon uses a trace facility to report on the I/O activity of physical and
logical storage, including your actual files. The I/O activity monitored is based
on the time interval that is specified when running the trace. It reports on all
layers of file system utilization, including the Logical Volume Manager (LVM),
virtual memory, and physical disk layers. Without any flags, it runs in the
background while application programs or system commands are being run and
monitored. The trace starts automatically until it is stopped. At that time, the
command generates an I/O activity report and exits. It can also process a trace
file that has been recorded by the trace facility. Reports can then be generated
from this file. Because reports generated to standard output usually scroll past
your screen, it's recommended that you use the -o
option to write the output to a file (see Listing 2).
Listing 2. Using filemon with the
-o option
l488pp065_pub[/] > filemon -o dbmon.out -O all
Run trcstop command to signal end of trace.
Thu Aug 12 09:07:06 2010
System: AIX 7.1 Node: l488pp065_pub Machine: 00F604884C00
l488pp065_pub[/] > trcstop
l488pp065_pub[/] > cat dbmon.out
Thu Aug 12 09:10:09 2010
System: AIX 7.1 Node: l488pp065_pub Machine: 00F604884C00
Cpu utilization: 72.8%
Cpu allocation: 100.0%
21947755 events were lost. Reported data may have inconsistencies or errors.
Most Active Files
------------------------------------------------------------------------
#MBs #opns #rds #wrs file volume:inode
------------------------------------------------------------------------
0.4 1 101 0 unix /dev/hd2:82241
0.0 9 10 0 vfs /dev/hd4:9641
0.0 4 6 1 db.sql
0.0 3 6 2 ksh.cat /dev/hd2:111192
0.0 1 2 0 cmdtrace.cat /dev/hd2:110757
0.0 45 1 0 null
0.0 1 1 0 dd.cat /dev/hd2:110827
0.0 9 2 0 SWservAt /dev/hd4:9156
0.0 1 0 3 db2.sql
0.0 9 2 0 SWservAt.vc /dev/hd4:9157
Most Active Segments
------------------------------------------------------------------------
#MBs #rpgs #wpgs segid segtype volume:inode
------------------------------------------------------------------------
0.1 2 13 8359ba client
Most Active Logical Volumes
------------------------------------------------------------------------
util #rblk #wblk KB/s volume description
------------------------------------------------------------------------
0.04 0 32 0.3 /dev/hd9var /var
0.00 0 48 0.5 /dev/hd8 jfs2log
0.00 0 8 0.1 /dev/hd4 /
Most Active Physical Volumes
------------------------------------------------------------------------
util #rblk #wblk KB/s volume description
------------------------------------------------------------------------
0.00 0 72 0.7 /dev/hdisk0 N/A
Most Active Files Process-Wise
------------------------------------------------------------------------
#MBs #opns #rds #wrs file PID(Process:TID)
------------------------------------------------------------------------
0.0 3 6 0 db.sql 7667828(ksh:9437345)
0.0 1 2 0 ksh.cat 7667828(ksh:9437345)
0.0 1 0 3 db2.sql 7667828(ksh:9437345)
0.0 1 0 1 db.sql 7733344(ksh:7405633)
0.4 1 101 0 unix 7667830(ksh:9437347)
0.0 1 2 0 cmdtrace.cat 7667830(ksh:9437347)
0.0 1 2 0 ksh.cat 7667830(ksh:9437347)
0.0 9 2 0 SWservAt 7667830(ksh:9437347)
0.0 9 2 0 SWservAt.vc 7667830(ksh:9437347)
0.0 1 0 0 systrctl 7667830(ksh:9437347)
0.0 44 0 44 null 4325546(slp_srvreg:8585241)
0.0 1 2 2 ksh.cat 7667826(ksh:23527615)
0.0 1 1 0 dd.cat 7667826(ksh:23527615)
0.0 1 1 0 null 7667826(ksh:23527615)
0.0 1 0 0 test 7667826(ksh:23527615)
0.0 8 8 0 vfs 3473482(topasrec:13566119)
0.0 1 0 0 CuAt.vc 3473482(topasrec:13566119)
0.0 1 0 0 CuAt 3473482(topasrec:13566119)
0.0 1 2 0 vfs 2097252(syncd:2490503)
0.0 1 0 0 installable 4260046(java:15073489)
Most Active Files Thread-Wise
------------------------------------------------------------------------
#MBs #opns #rds #wrs file TID(Process:PID)
------------------------------------------------------------------------
0.0 3 6 0 db.sql 9437345(ksh:7667828)
0.0 1 2 0 ksh.cat 9437345(ksh:7667828)
0.0 1 0 3 db2.sql 9437345(ksh:7667828)
0.0 1 0 1 db.sql 7405633(ksh:7733344)
0.4 1 101 0 unix 9437347(ksh:7667830)
0.0 1 2 0 cmdtrace.cat 9437347(ksh:7667830)
0.0 1 2 0 ksh.cat 9437347(ksh:7667830)
0.0 9 2 0 SWservAt 9437347(ksh:7667830)
0.0 9 2 0 SWservAt.vc 9437347(ksh:7667830)
0.0 1 0 0 systrctl 9437347(ksh:7667830)
0.0 44 0 44 null 8585241(slp_srvreg:4325546)
0.0 1 2 2 ksh.cat 23527615(ksh:7667826)
0.0 1 1 0 dd.cat 23527615(ksh:7667826)
0.0 1 1 0 null 23527615(ksh:7667826)
0.0 1 0 0 test 23527615(ksh:7667826)
0.0 8 8 0 vfs 13566119(topasrec:3473482)
0.0 1 0 0 CuAt.vc 13566119(topasrec:3473482)
0.0 1 0 0 CuAt 13566119(topasrec:3473482)
0.0 1 2 0 vfs 2490503(syncd:2097252)
0.0 1 0 0 installable 15073489(java:4260046)
dbmon.out: END
|
Look for long seek times, as they can result in decreased application performance. By looking at the read and write sequence counts in detail, you can further determine if the access is sequential or random. This helps you when it is time to do your I/O tuning. This output clearly illustrates that there is no I/O bottleneck visible. Filemon provides a tremendous amount of information and, truthfully, we've found there is too much information at times. Further, there can be a performance hit using filemon, depending on how much general file activity there is while filemon is running. Let's look at the topas results while running filemon (see Figure 1).
Figure 1. topas results while running filemon
In the figure above, filemon is taking up almost 60 percent of the CPU! This is actually less than in previous AIX versions but still a significant impact on your overall system performance. We don't typically like to recommend performance tools that have such a substantial overhead, so we'll reiterate that while filemon certainly has a purpose, you need to be very careful when using it.
What about fileplace? Fileplace reports the placement of a file's blocks within a file system. It is commonly used to examine and assess the efficiency of a file's placement on disk. For what purposes do you use it? One reason would be to help determine if some of the heavily utilized files are substantially fragmented. It can also help you determine the physical volume with the highest utilization and whether or not the drive or I/O adapter is causing the bottleneck.
Let's look at an example of a frequently accessed file in Listing 3.
Listing 3. Frequently accessed file
fileplace -pv /tmp/logfile
File: /tmp/logfile Size: 63801540 bytes Vol: /dev/hd3
Blk Size: 4096 Frag Size: 4096 Nfrags: 15604
Inode: 7 Mode: -rw-rw-rw- Owner: root Group: system
Physical Addresses (mirror copy 1) Logical Extent
---------------------------------- ----------------
02884352-02884511 hdisk0 160 frags 655360 Bytes, 1.0% 00000224-00000383
02884544-02899987 hdisk0 15444 frags 63258624 Bytes, 99.0% 00000416-00015859
unallocated -27 frags -110592 Bytes 0.0%
15604 frags over space of 15636 frags: space efficiency = 99.8%
2 extents out of 15604 possible: sequentiality = 100.0%
|
You should be interested in space efficiency and sequentiality here. Higher space
efficiency means files are less fragmented and provide better sequential file
access. A higher sequentiality tells you that the files are more contiguously
allocated, which will also be better for sequential file access. In the case here,
space efficiency could be better while sequentiality is quite high. If the space
and sequentiality are too low, you might want to consider file system
reorganization. You would do this with the reorgvg
command, which can improve logical volume utilization and efficiency. You may also
want to consider using the degrafs command which can help ensure that the free space on your filesystem is contiguous, which will help with future writes and file creates. Defragmentation can occur in the background while you are using your system.
This section discusses the use of the ioo command,
which is used for virtually all I/O-related tuning parameters.
Like vmo, you need to be extremely careful when
changing ioo parameters, as changing parameters on the
fly can cause severe performance degradation. Table 2
details specific tuning parameters that you use often for JFS file systems. As you
can see, the majority of the tuning commands for I/O utilize the
ioo utility.
Table 2. Specific tuning parameters
| Function | JFS tuning parameter | Enhanced JFS tuning parameter |
|---|---|---|
| Sets max amount of memory for caching files |
vmo -o maxperm=value
|
vmo -o maxclient=value (< or =
maxperm) |
| Sets min amount of memory for caching |
vmo -o minperm=value
| n/a |
| Sets a limit (hard) on memory for caching |
vmo -o strict_maxperm
|
vmo -o maxclient (hard limit) |
| Sets max pages used for sequential read ahead |
ioo -o maxpgahead=value
|
ioo -o j2_maxPageReadAhead=value
|
| Sets min pages used for sequential read ahead |
ioo -o minpgahead
|
ioo -o j2_minPageReadAhead=value
|
| Sets max number of pending write I/O to a file |
chhdev -l sys0 -a maxpout maxpout
|
chdev -l sys0 -a maxpout maxpout
|
| Sets min number of pending write I/Os to a file at which programs blocked by maxpout might proceed |
chdev -l sys0 -a minpout minpout
|
chdev -l sys0 -a minpout minpout
|
| Sets the amount of modified data cache for a file with random writes |
ioo -o maxrandwrt=value
|
ioo -o j2_maxRandomWrite ioo -o j2_nRandomCluster
|
| Controls gathering of I/Os for sequential write behind |
ioo -o numclust=value
|
ioo -o j2_nPagesPerWriteBehindCluster=value
|
| Sets the number of f/s bufstructs |
ioo -o numfsbufs=value
|
ioo -o j2_nBufferPerPagerDevice=value
|
Let's further discuss some of the more important parameters below, as we've
already discussed all the vmo tuning parameters in the
memory tuning series (see Resources).
There are several ways you can determine the existing
ioo values on your system. The long display listing for
ioo clearly gives you the most information (see
Listing 4). It lists the values for current, reboot value,
range, unit, type, and dependencies of all tunables parameters managed by
ioo.
Listing 4. Display for ioo
root@lpar29p682e_pub[/] > ioo -L
NAME CUR DEF BOOT MIN MAX UNIT TYPE
DEPENDENCIES
j2_atimeUpdateSymlink 0 0 0 0 1 boolean D
j2_dynamicBufferPreallo 16 16 16 0 256 16K slabs D
j2_inodeCacheSize 400 400 400 1 1000 D
j2_maxPageReadAhead 128 128 128 0 64K 4KB pages D
j2_maxRandomWrite 0 0 0 0 64K 4KB pages D
j2_maxUsableMaxTransfer 512 512 512 1 4K pages M
j2_metadataCacheSize 400 400 400 1 1000 D
j2_minPageReadAhead 2 2 2 0 64K 4KB pages D
j2_nBufferPerPagerDevice 512 512 512 512 256K M
j2_nPagesPerWriteBehindC 32 32 32 0 64K D
j2_nRandomCluster 0 0 0 0 64K 16KB clusters D
j2_nonFatalCrashesSystem 0 0 0 0 1 boolean D
j2_syncModifiedMapped 1 1 1 0 1 boolean D
j2_syncdLogSyncInterval 1 1 1 0 4K iterations D
jfs_clread_enabled 0 0 0 0 1 boolean D
jfs_use_read_lock 1 1 1 0 1 boolean D
lvm_bufcnt 9 9 9 1 64 128KB/buffer D
maxpgahead minpgahead 8 8 8 0 4K 4KB pages D
maxrandwrt 0 0 0 0 512K 4KB pages D
memory_frames 512K 512K 4KB pages S
Minpgahead maxpgahead 2 2 2 0 4K 4KB pages D
numclust 1 1 1 0 2G-1 16KB/cluster D
numfsbufs 196 196 196 1 2G-1 M
pd_npages 64K 64K 64K 1 512K 4KB pages D
pgahd_scale_thresh 0 0 0 0 419430 4KB pages D
pv_min_pbuf 512 512 512 512 2G-1 D
sync_release_ilock 0 0 0 0 1 boolean D
n/a means parameter not supported by the current platform or kernel
Parameter types:
S = Static: cannot be changed
D = Dynamic: can be freely changed
B = Bosboot: can only be changed using bosboot and reboot
R = Reboot: can only be changed during reboot
C = Connect: changes are only effective for future socket connections
M = Mount: changes are only effective for future mountings
I = Incremental: can only be incremented
d = deprecated: deprecated and cannot be changed
|
Listing 5 below shows you how to change a tunable.
Listing 5. Changing a tunable
root@lpar29p682e_pub[/] > ioo -o maxpgahead=32
Setting maxpgahead to 32
root@lpar29p682e_pub[/] >
|
This parameter is used for JFS only. For JSF2, there are additional file system performance enhancements including sequential page read ahead and sequential and random write behind. The Virtual Memory Manager (VMM) of AIX anticipates page requirements for observing the patterns of files that are accessed. When a program accesses two pages of a file, VMM assumes that the program keeps trying to access the file in a sequential method. The number of pages to be read ahead can be configured using VMM thresholds. With JFS2, make note of these two important parameters:
-
J2_minPageReadAhead: Determines the number of pages ahead when VMM initially detects a sequential pattern. -
J2_maxPageReadAhead: Determines the maximum amount of pages that VMM can read in a sequential file.
Sequential and random write behind relates to writing modified pages in memory to
disk after a certain threshold is reached. In this way, it does not wait for
syncd to flush out pages to disk. The reason for this
is to limit the amount of dirty pages in memory, which further reduces I/O
overhead and disk fragmentation. The two types of write behind are sequential and
random. With sequential write behind, pages do not stay in memory until the
syncd daemon runs, which can cause real bottlenecks.
With random write behind, the number of pages in memory exceeds a specified amount
and all subsequent pages are written to disk.
For the sequential write behind, you should specify the number of pages to be scheduled to be written; the j2_nPagesPerWriterBehindCluster parameter specifies this parameter. By default the value is 32 (that is, 128KB), for modern disks and high write environments, such as databases, you may want to increase this parameter so that more data is written in a single block when the data needs to be synced to disk.
The random write behind can be configured by changing the values of j2_nRandomCluster and j2_maxRandomWrite.
The j2_maxRandomWrite parameter specifies the number of pages of a file that are allowed to stay in memory. The default is 0 (meaning that information is written out as quickly as possible), and this is used to ensure data integrity. If you are willing to sacrifice some integrity in the event of a system failure, for better write performance you can increase these values. This keeps them in cache, so a system failure may not have written the data to disk properly. The j2_nRandomCluster defines the number of clusters apart two writes must be to be considered random. Increasing this value can lower the write frequency if you have a high number of files being modified at the same time.
Another important area worth
mentioning is large sequential I/O processing. When there is too much simultaneous
I/O to your file systems, the I/O can bottleneck at the f/s level. In this case,
you should increase the j2_nBufferPerPagerDevice
parameter (numfsbus with JFS). If you use raw I/O as opposed to file systems, this
same type of bottleneck can occur through LVM. Here is where you might want to
tune the lvm_bufcnt parameter.
This article focused on file system performance. You examined the enhancements in
JFS2 and why it would be the preferred file system. Further, you used tools, such
as filemon and fileplace, to gather more detailed information about the actual
file structures and how they relate to I/O performance. Finally, you tuned your
I/O subsystem by using the ioo command. You learned
about the J2_minPageReadAhead and
J2_maxPageReadAhead parameters in an effort to increase
performance when encountering sequential I/O.
During this three-part series on I/O you learned that, perhaps more so than any other subsystem, your tuning must start prior to stress testing your systems. Architecting the systems properly can do more to increase performance than anything you can do with tuning I/O parameters. This includes strategic disk placement and making sure you have enough adapters to handle the throughput of your disks. Further, while this series focused on I/O, understand that the VMM is also very tightly linked with I/O performance and must also be tuned to receive optimum I/O performance.
Learn
-
Improving database
performance with AIX concurrent I/O:
Read this white paper for more information on how to improve database performance.
-
IBM Redbooks:
Database Performance Tuning on AIX is designed to help system designers,
system administrators, and database administrators design, size, implement,
maintain, monitor, and tune a Relational Database Management System (RDMBS) for
optimal performance on AIX.
- "Power
to the people: A history of chip making at IBM"
(developerWorks, December 2005): This article covers the IBM power architecture.
- "Processor affinity on AIX"
(developerWorks, November 2006): Using process affinity settings to bind or unbind
threads can help you find the root cause of troublesome hang or deadlock problems.
Read this article to learn how to use processor affinity to restrict a process and
run it only on a specified central processing unit (CPU).
- "CPU Monitoring
and Tuning"
(developerWorks, March 2002): Learn how standard AIX tools can help you determine CPU
bottlenecks.
-
The AIX 7.1 Information Center is your source for technical information about the AIX operating system.
- The IBM AIX Version 6.1 Differences Guide can be a useful resource for understanding changes in AIX 6.1.
-
Operating System and Device Management:
This document from IBM provides users and system administrators with complete
information that can affect your selection of options when performing such tasks
as backing up and restoring the system, managing physical and logical storage, and
sizing appropriate paging space.
-
IBM Redbooks:
For help in obtaining IBM certification for AIX 5L and the eServer®
pSeries®, read IBM Certification Study Guide for eServer p5 and pSeries
Administration and Support for AIX 5L Version 5.3.
-
AIX and
UNIX:
The AIX and UNIX developerWorks zone provides a wealth of information relating to
all aspects of AIX systems administration and expanding your UNIX skills.
-
New to AIX and UNIX?:
Visit the New to AIX and UNIX page to learn more about AIX and UNIX.
-
Safari bookstore:
Visit this e-reference library to find specific technical resources.
-
developerWorks technical events and webcasts:
Stay current with developerWorks technical events and webcasts.
-
Podcasts: Tune in and
catch up with IBM technical experts.
-
Future Tech:
Visit Future Tech's site to learn more about their latest offerings.
Get products and technologies
- Download the
nmon analyzer.
-
IBM trial software:
Build your next development project with software for download directly from
developerWorks.
Discuss
- Follow developerWorks on Twitter.
- Get involved in the My developerWorks community.
-
Participate in the AIX and UNIX® forums:
- AIX Forum
- AIX Forum for developers
- Cluster Systems Management
- Performance Tools Forum
- Virtualization Forum
- More AIX and UNIX Forums
- Participate in the
developerWorks blogs
and get involved in the developerWorks community.
- AIX 7 Open Beta:
This forum is for technical discussions supporting the AIX 7 Open Beta Program.
Martin Brown has been a professional writer for over eight years. He is the author of numerous books and articles across a range of topics. His expertise spans myriad development languages and platforms - Perl, Python, Java, JavaScript, Basic, Pascal, Modula-2, C, C++, Rebol, Gawk, Shellscript, Windows, Solaris, Linux, BeOS, Mac OS/X and more - as well as web programming, systems management and integration. Martin is a regular contributor to ServerWatch.com, LinuxToday.com, IBM developerWorks and a regular blogger at Computerworld, The Apple Blog, and other sites. He is also a Subject Matter Expert (SME) for Microsoft. He can be contacted through his website at http://www.mcslp.com.
Ken Milberg is a technology writer and site expert for techtarget.com and provides Linux technical information and support at searchopensource.com. He is also a writer and technical editor for IBM Systems Magazine, Open Edition. Ken holds a bachelor's degree in computer and information science and a master's degree in technology management from the University of Maryland. He is the founder and group leader of the NY Metro POWER-AIX/Linux Users Group. Through the years, he has worked for both large and small organizations and has held diverse positions from CIO to Senior AIX Engineer. Today, he works for Future Tech, a Long Island-based IBM Business Partner. Ken is a PMI certified Project Management Professional (PMP), an IBM Certified Advanced Technical Expert (CATE, IBM System p5 2006), and a Solaris Certified Network Administrator (SCNA). You can contact him at kmilberg@gmail.com.



