Performance Monitoring Tips and Techniques
These are the personal notes of Nigel Griffiths
From my personal experience, (your mileage may vary):
Perfpmr is the official AIX Support performance data gathering tool. If you report a performance problem then after the usual checks on your software levels, AIX Support will ask you to use this tool to gather loads of information that you must return, as soon as possible. It is actually a set of shell scripts that use standard tools to gather information and write report summaries. This means it is very useful for you too and why reinvent the wheel, when the IBM Austin experts are maintaining this excellent tool for you. Get the latest perfpmr version
It comes with an excellent README file with all the details on running it, but briefly, find at least 100 Mbytes of free disk space and as the root user untar the file and run the master perfpmr.sh shell script. This needs to run during the problem (of course) and I suggest for 5 minutes = 600 seconds. For example, ./perfpmr.sh 600 It actually takes longer than this as there are several sections and the final phase gathers lot of configuration details that takes a long time with machine with lots of disks. Once finished take a look at the summary (.sum) files as these are readable. You may be able to sort out the performance problem yourself from this data alone.
But perfpmr is more than this because:
- it was precisely documented your system configuration at the hardware and AIX levels - this saves you doing this.
- if you take regular perfpmr data of the system running happily on a normal workload day. Then when there is a problem - you or the AIX support team can take a look at the differences between the good and bad performance captures and this will make diagnosis ten times simpler.
- if you should take perfpmr data of the system running happily before and after any minor or major changes to the system. For example software upgrades, AIX upgrades, changes to disk subsystems and adding software. Again, you will be able to determine what changed - for good or bad - ten times faster with before and after data captures.
My favorite Desktop tools of the job
I have a Windows-based Thinkpad and AIX servers and this is what I use every day:
- Virtual Network Computing VNC from UltraVNC - so I can have X Windows at zero cost and it stays running for tomorrow.
- Putty - so I don't use the horrid Windows tenet and terminal emulation is flawless.
- gVIM - on AIX (for colourising code) and Windows so I have a powerful editor and graphics too.
- Firefox on Windows and Mozilla on AIX - for browsing.
- Filezilla in Windows for graphical FTP client to and from Windows to AIX.
- Keepass2 to save all my 100's of passwords in a friends cut'n'paste tool with encrypted database.
And all of the above are freely available.
- nmon - well, I did write it
- nmon analyser to graph the data to work out what the machine is doing longer trem
- lsconf - to document the machine
vmstat and iostat for a quick look but the output can drive you mad after ten minutes
- On vmstat, consider using the -Iwt flags.
- On iostat consider using the -alDRT flags. If hdisks are accessed using AIX MPIO, then -a will cause each hdisk to be listed multiple times (once per path) in each interval.
- The -m flag can be considered instead of -a for AIX MPIO, but -m causes voluminous output to be generated, as well.
- filemon - to find out the busy filesystems and files (see How to use AIX V5.3 filemon to determine where I/O requests originate for more info)
- fileplace - to check for scrambled files (see How to use AIX V5.3 fileplace to determine the location on disk of a given file block for more info)
- rmss - using the old trick to see if memory is really used (see the AIX V5.3 memory use article for more info)
- lparstat to check on up on shared processor LPARs
- perfpmr and snap for reporting performance issues
- lvmstat to check up on high use disks
- pGraph to display graphs of files of performance statistics automatically recorded in /etc/perf/daily in recent levels of AIX (see the How to display performance statistics recorded in /etc/perf/daily for more info)
Performance Bottleneck Definitions
- CPU bound when %user + %sys greater than 80% (vmstat)
- Application disk bound when %tm_act greater than 70% (iostat)
- Paging space low when paging space greater than 70% active (lsps -a)
- Paging bound paging volumes %busy (> 30% of the I/O) (iostat)
- Thrashing rising page outs, CPU wait and run queue.
- Disk I/O bound when %iowait greater than 40% (iostat)
- Although in recent years this might not be true as CPUs are much faster and disks only a bit faster.
- This means many workloads are not disk bound and the CPUs deal with the data faster than the disks deliver it.
- So high I/O wait is perfectly normal on many systems.
Setting minfree and maxfree on an AIX V5.3 system
minfree = (maximum of either 960 or # of logical CPUs * 120) divided by # of memory pools
maxfree = minfree + (# of logical CPUs * maximum read ahead) divided by # of memory pools
- of logical CPUs from bindprocessor -q (count number of available processors)
of memory pools from vmstat -v
(note: If the number is 0 use 1 as a default)
Maximum read ahead is the greater of maxpgahead or j2_maxPageReadAhead from ioo -a
Tuning minfree and maxfree
You need to increase the value of minfree when you see 'Free Frame Waits' increasing over time.
Use the 'vmstat -s' to display the currently value of 'Free Frame Waits'.
Remember, to calculate a new value for maxfree.
Large Disk Subsystem Setup
Setting up Disks? - Want some advice? Here are my (Nigel Griffiths) top tips:
- More disks are goodness - increasingly hard to justify with bigger disks but spindles count for performance
- Use ALL of the disks - ALL of the time. But make the computer do this i.e. not manually move data around.
- Hose it all about (especially for systems with less than 8 disks = one RAID5 is best)
- Using RAID5 - 7+1 parity for maximum disk use
- Aim for 16 to 32 LUNs - make then big enough to reduce the number of LUNs - no one can manage a thousand LUN's
- Use 4 paths = 64 to 128 vpaths - two paths also OK. Never use more than 4 paths.
- All LUNs same size
- Clear map of layout - you must known where LUNs are placed and overlap.
- 4 to 16 filesystems (never just one filesystem) to avoid free space allocation bottlenecks.
- With AIX 5.2 onwards, the 64 bit Kernel and JFS2 should be thought of as the default.
- LVM striped at 64KB or 128KB stripe width.
- Database should have 8KB or better yet 16KB block size minimum.
- Mix data, index, logs up on all the disks - to avoid hot spots (don't have disks for specific data types)
- Direct attach, if you are the only ESS/FAStT user
- Sequential I/O - big block = max throughput above 64KB blocks, also 256KB or 1MB works well (4KB can kill your throughput)
- Random I/O - many files, equally spread across filesystems and hence disks
- Two Fibre Channel adapters per TB,
- Expect 70% adapter bandwidth max (2 Gbit FC = max 200MB/s and 140MB/s real)
For ESS/FAST/DS in particular (these numbers are 2 years old and probably need updating):
- 6 to 8 Fibre Channel adapters per ESS
- Disk Size: 36 GB=OK, 72 GB OK for Random, 146 GB only for archive
- Use ESS bid against EMC and HDS
- Typically 30+ hosts
- Disk qdepth ESSF20=60 and ESS800=90 FAST900=1024/disks/hosts FAST500+600= 212/hosts/disks
- Fibre Channel memory DMA=1MB, CMD's 248
The postings on this site solely reflect the personal views of the authors and do not necessarily represent the views, positions, strategies or opinions of IBM or IBM management.