| |  | h1.Performance Monitoring Tips and Techniques |
| | | h1. Performance Monitoring Tips and Techniques |
| | |
 |  | h2.What are the most common performance problem causes? |
| | |
| | |!hints.gif!|\\ |
| | | h2. What are the most common performance problem causes? |
| | | !hints.gif! | \\ |
| | From my personal experience (your mileage may vary): |
 |  | * 50% poor disk layout + mgmt - some disks 90%+ busy while 50% not used at all - do you have a clear document that maps the files to actual disks? |
| | | * 50% poor disk layout + mgmt - some disks 90%\+ busy while 50% not used at all - do you have a clear document that maps the files to actual disks? |
| | * 10% poor setup of RDBMS tuning parameters relating to memory use |
 |  | * 10% single threaded batch applications (and we have been using SMP for 9 years!!) |
| | | * 10% single threaded batch applications (and we have been using SMP for 9 years\!\!) |
| | * 10% poorly written customer extensions to standard applications |
 |  | * 5% system running with errors in the errpt log file (including CPU failures!!) |
| | | * 5% system running with errors in the errpt log file (including CPU failures\!\!) |
| | * 5% paging on large RAM (>2 GB) systems & vmtune not use to set min/maxperm |
| | * 5% AIX problems already discovered and fixed but AIX was not up to date. |
| | * 4% badly ported app = not compiled with optimization or on old AIX versions |
 |  | * 1% genuine bugs in AIX or commands |
| | and every single one of these was reported as a problem with the hardware!!! | |
| | | * 1% genuine bugs in AIX or commands\\ |
| | and every single one of these was reported as a problem with the hardware\!\!\! | |
| | |
 |  | h2.Perfpmr - the performance guru's secret weapon |
| | | h2. Perfpmr - the performance guru's secret weapon |
| | |
| | Perfpmr is the official AIX Support performance data gathering tool. If you report a performance problem then after the usual checks on your software levels, AIX Support will ask you to use this tool to gather loads of information that you must return, as soon as possible. It is actually a set of shell scripts that use standard tools to gather information and write report summaries. This means it is very useful for you too and why reinvent the wheel, when the IBM Austin experts are maintaining this excellent tool for you. [Get the latest AIX version|ftp://ftp.software.ibm.com/aix/tools/perftools/perfpmr/] |
| | |
| | It comes with an excellent README file with all the details on running it, but briefly, find at least 100 Mbytes of free disk space and as the root user untar the file and run the master perfpmr.sh shell script. This needs to run during the problem (of course) and I suggest for 5 minutes = 600 seconds. For example, ./perfpmr.sh 600 It actually takes longer than this as there are several sections and the final phase gathers lot of configuration details that takes a long time with machine with lots of disks. Once finished take a look at the summary (.sum) files as these are readable. You may be able to sort out the performance problem yourself from this data alone. |
| | |
| | But perfpmr is more than this because: |
| | * it was precisely documented your system configuration at the hardware and AIX levels - this saves you doing this. |
| | * if you take regular perfpmr data of the system running happily on a normal workload day. Then when there is a problem - you or the AIX support team can take a look at the differences between the good and bad performance captures and this will make diagnosis ten times simpler. |
 |  | * if you should take perfpmr data of the system running happily before and after any minor or major changes to the system. For example software upgrades, AIX upgrades, changes to disk subsystems and adding software. Again, you will be able to determine what changed - for good or bad - ten times faster with before and after data captures. |
| | | * if you should take perfpmr data of the system running happily before and after any minor or major changes to the system. For example software upgrades, AIX upgrades, changes to disk subsystems and adding software. Again, you will be able to determine what changed - for good or bad - ten times faster with before and after data captures. |
| | |
 |  | h2.My favorite Desktop tools of the job |
| | | h2. My favorite Desktop tools of the job |
| | |
| | I have a Windows based Thinkpad and AIX servers and this is what I use every day: |
| | # Virtual Network Computing *VNC* from TightVNC - so I can have X Windows at zero cost and it stays running for tomorrow |
 |  | # *WebSM Remote Client* - so I can remotely manage my AIX machines and HMC's. You can download this from you HMC (if you have it setup right and allowed the protocol) from http://<yourhmc/remote_client.html |
| | # *Putty* - so I don't use the horrid Windows tenet and terminal emulation is flawless |
| | # *VIM* - on AIX (for colourising code) and Windows so I have a powerful editor and graphics too. |
| | | # *WebSM Remote Client* \- so I can remotely manage my AIX machines and HMC's. You can download this from you HMC (if you have it setup right and allowed the protocol) from http://<yourhmc/remote_client.html |
| | # *Putty* \- so I don't use the horrid Windows tenet and terminal emulation is flawless |
| | # *VIM* \- on AIX (for colourising code) and Windows so I have a powerful editor and graphics too. |
| | # *Firefox* on Windows and Mozilla on AIX - for browsing |
| | # *Filezilla* in windows for graphical FTP client to and from Windows to AIX |
| | |
| | And all of the above is freely available |
| | |
 |  | h2.My favourite AIX Performance tools of the job |
| | # *nmon* - well I did write it :-) |
| | | h2. My favourite AIX Performance tools of the job |
| | |
| | # *nmon* \- well I did write it :) |
| | # *nmon analyser* to graph the data to work out what the machine is doing longer trem |
 |  | # *lsconf* - to document the machine |
| | # *vmstat* and *iostat* for a quick look but the output can drive you mad after ten minutes (consider using the {{-Iwt}} flags on {{vmstat}} and the {{-alDT}} flags on {{iostat}}, but if using the {{-a}} flag on {{iostat}}, apply the fix for [APAR IZ08753|http://www.ibm.com/support/docview.wss?uid=isg1IZ08753] - IOSTAT -A FLAG CAUSES BAD VALUES FOR AVGSERV, RPS, WPS) |
| | # *filemon* - to find out the busy filesystems and files (see [How to use AIX V5.3 filemon to determine where I/O requests originate|AIXV53filemon] for more info) |
| | # *fileplace* - to check for scrambled files (see [How to use AIX V5.3 fileplace to determine the location on disk of a given file block|AIXV53fileplace] for more info) |
| | # *rmss* - using the old trick to see if memory is really used |
| | | # *lsconf* \- to document the machine |
| | # *vmstat* and *iostat* for a quick look but the output can drive you mad after ten minutes (consider using the {{\-Iwt}} flags on {{vmstat}} and the {{\-malT}} flags with MPIO hdisks or the {{\-alDT}} flags with non-MPIO hdisks on {{iostat}}, but if using the {{\-a}} flag on {{iostat}}, apply the fix for [APAR IZ08753|http://www.ibm.com/support/docview.wss?uid=isg1IZ08753] \- IOSTAT \-A FLAG CAUSES BAD VALUES FOR AVGSERV, RPS, WPS) |
| | # *filemon* \- to find out the busy filesystems and files (see [How to use AIX V5.3 filemon to determine where I/O requests originate|AIXV53filemon] for more info) |
| | # *fileplace* \- to check for scrambled files (see [How to use AIX V5.3 fileplace to determine the location on disk of a given file block|AIXV53fileplace] for more info) |
| | # *rmss* \- using the old trick to see if memory is really used |
| | # *lparstat* to check on up on shared processor LPARs |
| | # *perfpmr* and snap for reporting performance issues |
| | # *lvmstat* to check up on high use disks |
| | |
 |  | |
| | h2. Performance Bottleneck Definitions |
 |  | |
| | # CPU bound when %user + %sys greater than 80% (vmstat) |
| | # Application disk bound when %tm_act greater than 70% (iostat) |
 |  | # Paging space low when paging space greater than 70% active (lsps -a) |
| | | # Paging space low when paging space greater than 70% active (lsps \-a) |
| | # Paging bound paging volumes %busy (> 30% of the I/O) (iostat) |
| | # Thrashing rising page outs, CPU wait and run queue. |
| | # Disk I/O bound when %iowait greater than 40% (iostat) |
 |  | * Although in recent years this might not be true as CPUs are much faster and disks only a bit faster. |
| | | |
| | * Although in recent years this might not be true as CPUs are much faster and disks only a bit faster. |
| | * This means many workloads are not disk bound and the CPUs deal with the data faster than the disks deliver it. |
| | * So high I/O wait is perfectly normal on many systems. |
| | |
 |  | h2. Setting minfree and maxfree on an AIX 5.3 System |
| | | {anchor:minmaxfree} |
| | |
 |  | minfree = (maximum of either 960 or # of logical CPUs * 120) divided by # of memory pools |
| | | h2. Setting minfree and maxfree on an AIX V5.3 system |
| | |
 |  | maxfree = minfree + (# of logical CPUs * maximum read ahead) divided by # of memory pools |
| | | minfree = (maximum of either 960 or # of logical CPUs * 120) divided by # of memory pools |
| | |
 |  | Where, |
| | # of logical CPUs from bindprocessor -q (count number of available processors) |
| | # of memory pools from vmstat -v |
| | (note: If the number is 0 use 1 as a default) |
| | Maximum read ahead is the greater of maxpgahead or j2_maxPageReadAhead from ioo -a |
| | | maxfree = minfree + (# of logical CPUs * maximum read ahead) divided by # of memory pools |
| | |
 |  | h2. Tuning minfree and maxfree |
| | | Where, |
| | # of logical CPUs from bindprocessor \-q (count number of available processors) |
| | # of memory pools from vmstat \-v |
| | (note: If the number is 0 use 1 as a default) |
| | Maximum read ahead is the greater of maxpgahead or j2_maxPageReadAhead from ioo \-a |
| | |
 |  | You need to increase the value of minfree when you see 'Free Frame Waits' increasing over time. |
| | Use the 'vmstat -s' to display the currently value of 'Free Frame Waits'. |
| | Remember, to calculate a new value for maxfree. |
| | | h2. Tuning minfree and maxfree |
| | |
 |  | h2.Large Disk Subsystem Setup |
| | | You need to increase the value of minfree when you see 'Free Frame Waits' increasing over time. |
| | Use the 'vmstat \-s' to display the currently value of 'Free Frame Waits'. |
| | Remember, to calculate a new value for maxfree. |
| | |
 |  | Setting up Disks? - Want some advice? Here are my (Nigel Griffiths) top tips: |
| | | h2. AIX V5.3 memory use |
| | |
 |  | See [AIX V5.3 memory use article|AIXV53memory]. |
| | |
| | h2. Large Disk Subsystem Setup |
| | |
| | Setting up Disks? - Want some advice? Here are my (Nigel Griffiths) top tips: |
| | # *More disks* are goodness - increasingly hard to justify with bigger disks but spindles count for performance |
| | # *Use ALL of the disks - ALL of the time*. But make the computer do this i.e. not manually move data around. |
| | # *Hose it all about* (especially for systems with less than 8 disks = one RAID5 is best) |
 |  | # Using *RAID5* - 7+1 parity for maximum disk use |
| | # Aim for *16 to 32 LUNs* - make then big enough to reduce the number of LUNs - no one can manage a thousand LUN's |
| | | # Using *RAID5* \- 7+1 parity for maximum disk use |
| | # Aim for *16 to 32 LUNs* \- make then big enough to reduce the number of LUNs - no one can manage a thousand LUN's |
| | # Use *4 paths* = 64 to 128 vpaths - two paths also OK. Never use more than 4 paths. |
| | # All *LUNs same size* |
 |  | # Clear *map of layout* - you must known where LUNs are placed and overlap. |
| | | # Clear *map of layout* \- you must known where LUNs are placed and overlap. |
| | # *4 to 16 filesystems* (never just one filesystem) to avoid free space allocation bottlenecks. |
| | # With AIX 5.2 onwards, the *64 bit Kernel and JFS2* should be thought of as the default. |
| | # LVM *striped at 64KB* or 128KB stripe width. |
| | # *Database* should have 8KB or better yet 16KB block size minimum. |
| | # *Mix data, index, logs* up on all the disks - to avoid hot spots (don't have disks for specific data types) |
| | # *Direct attach*, if you are the only ESS/FAStT user |
| | # Sequential I/O - *big block* = max throughput above 64KB blocks, also 256KB or 1MB works well (4KB can kill your throughput) |
| | # Random I/O - *many files*, equally spread across filesystems and hence disks |
| | # Two Fibre Channel adapters per TB, |
 |  | # Expect 70% adapter bandwidth max (2 Gbit FC = max 200MB/s and 140MB/s real) |
| | | # Expect 70% adapter bandwidth max (2 Gbit FC = max 200MB/s and 140MB/s real) |
| | |
| | For ESS/FAST/DS in particular (these numbers are 2 years old and probably need updating): |
| | # 6 to 8 Fibre Channel adapters per ESS |
| | # Disk Size: 36 GB=OK, 72 GB OK for Random, 146 GB only for archive |
| | # Use ESS bid against EMC and HDS |
 |  | # Typically 30+ hosts |
| | | # Typically 30\+ hosts |
| | # Disk qdepth ESSF20=60 and ESS800=90 FAST900=1024/disks/hosts FAST500+600= 212/hosts/disks |
 | | # Fibre Channel memory DMA=1MB, CMD's 248 |
| | | # Fibre Channel memory DMA=1MB, CMD's 248 |
| | |
| | (!) The postings on this site solely reflect the personal views of the authors and do not necessarily represent the views, positions, strategies or opinions of IBM or IBM management. |