How to use AIX V5.3 filemon to determine where I/O requests originate
This web page is meant to supplement, not replace, other documentation which has been published:
The filemon command can produce reports on the following:
| Physical volume (hdisk) activity |
Read and write activity to AIX hdisks during filemon data collection, including rates, block sizes, response times, sequentiality of I/O read and write accesses, throughput, and hdisk utilization.
Most of these metrics are also reported by nmon.
Some of these metrics are reported by the iostat command. Others can be estimated from iostat output, but those estimates are inaccurate in various ways. See AIX V5.3 iostat awk script for more information.
All AIX tools measure response time as seen by a System p server. If AIX does not process interrupts promptly, the response time it reports can be considerably longer than the response time a disk storage subsystem is delivering. AIX lrud activity can sometimes delay interrupt processing:
|
| Logical volume activity |
Read and write activity to AIX logical volumes during filemon data collection, including rates, block sizes, response times, sequentiality of I/O read and write accesses, throughput, and logical volume utilization.
Please note that logical volume utilization does not mean much, since I/O operations to LVs do not block one another and an LV which is spread across many hdisks can be 100% busy when the hdisks on which the LV resides are not particularly busy. |
| Virtual memory segment activity |
Read and write activity to AIX virtual memory segments, including rates, block sizes, response times, and sequentiality of I/O read and write accesses.
Please note that all AIX JFS and JFS2 files (except those opened with Direct or Concurrent I/O (DIO or CIO)) are cached in real memory. Reads occur when a process makes a read file system call or when a process accesses a page in a file mapped to memory when the page is not resident in memory. Writes occur when a process makes a write file system call or when the AIX LRU daemon steals a dirty file page to make room in real memory for other pages.
If a segment for a given volume:inode does not appear in this report, it is because nothing was transferred between memory and disk for the segment during filemon data collection. |
| File activity |
Read and write activity to files generated by read , write , open , and lseek system calls during filemon data collection.
Please note that read and write requests can be satisfied from cached pages in real memory, so system calls might not result in physical I/O to hdisks.
Please note that lio_listio system calls (requesting asynchronous I/O) are not monitored by filemon, so the file activity report does not include file activity generated using asynchronous I/O (AIO) requests. And filemon does not monitor page faults, so the file activity report does not include file activity caused when a process accesses or updates a page in a file mapped to memory. The Virtual memory segment activity report does report on activity generated in these ways. |
The remainder of this web page will focus on the filemon Most Active Files report. See below for a discussion of that report.
How to collect a filemon report
Because filemon uses AIX trace for data collection, root access is required.
Here is an example of use of filemon which will capture 10 seconds of trace information for a workload which is already running:
Notice that a list of processes is captured before, during, and after the trace. With a little experience using filemon in a particular environment, it might be possible to eliminate two of the three captures. But, as we shall see, it is very important to capture information about each process to help identify files in the filemon report.
Here is another example which captures trace information for a workload launched within the shell script:
When the second example above is run, we should see something like:
Here is /tmp/filemon.out produced by the second example above:
It is also possible to capture a trace file, format it, and then run filemon against the formatted trace. See below for more information.
How to interpret the Most Active Files report
Let's focus on the four busiest files:
The Most active files
article describes the columes in the report as follows:
| #MBs |
Total number of megabytes transferred for the file. Rows are sorted by this field in decreasing order. |
| #opns |
Number of open system calls for the file. |
| #rds |
Number of read system calls for the file. |
| #wrs |
Number of write system calls for the file. |
| file |
File name (full path name is in detailed report). |
| volume:inode |
The logical volume that the file resides in and the i-node number of the file in the associated file system. |
Please examine the #opns column and note that information is provided in the volume:inode column only when the trace captures at least one open system call (#opns), although volume:inode is not provided for null since it isn't a real file. If no open system calls are captured, filemon does not get (and can not report) a file's inode number and the file column will show only the ID of the process which is accessing the file and the file descriptor number through which I/Os are being driven.
4.1 megabytes were processed in all four of these files while trace information was being captured.
The first file is pid=30996_fd=1. Since we know stdin is fd=0, stdout is fd=1, and stderr is fd=2 (see the Standard input, standard output, and standard error files
article), we realize pid=30996_fd=1 is not a file at all, but rather stdout for process 30996:
Process 30996 is an AIX cat command. Given the test case used in the example shell script, that is no surprise. Writes to stdout frequently are written to a terminal or to stdin of another process, so they often don't result in any writes to hdisks. But stdout can, of course, be redirected to a file.
The second file is unix_mp, but what directory is the file in? The Detailed File Stats report has the file's full path name. Files appear in the same order in the Detailed File Stats report as they appear in the Most active files report.
Or the volume:inode information (/dev/lv00:102403) can be used to get the answer from the running system:
The full path to file unix_mp is /home/pittman/fmontest/unix_mp. Again, no surprise given the test case used in the example shell script. Reads from this file will result in I/O to hdisks only if the file pages are not already cached in memory. To determine which AIX processes are currently accessing the file, use:
The third file is pid=35890_fd=0, which we recognize as stdin for process 35890:
Process 35890 is an AIX dd command. Again, no surprise given the test case used in the example shell script. Reads from stdin are frequently satisfied by stdout of another process or stdin can be redirected to come from a file, so reads from stdin often don't result in any reads from hdisks. Hmmm, it looks like the dd forks a child and we are seeing stdin on the parent process, not the child.
The fourth most active file is null. Even though #opns is not zero for the file, no volume:inode information is provided. The Detailed File Stats report shows that the fourth file is /dev/null.
Suppose we capture a filemon trace for a few seconds on a running system. How likely are we to capture an open request for each file that is in use? Probably not very likely. If we know the application very well, we can probably guess what files are being accessed based on the process executable and the file descriptor number. But suppose we don't know the application very well. What then?
How about using truss to trace an application on a test system, invoking the application with the same flags used when the application was started on the running system, as captured in /tmp/psef.out?
In /tmp/truss.out, we find:
The file descriptor follows the = sign on the far right of each line. Had we seen pid=35890_fd=3 rather than null in /tmp/filemon.out, the contents of /tmp/truss.out show that the AIX dd command gets fd=3 when it opens the file /dev/null specified by the of= parameter.
Running filemon against an AIX V5.3 trace file
It is certainly possible to capture a trace file, format it, and then run filemon against the formatted trace. To capture a trace while collecting only those trace hooks required for filemon, specify -J filemon on the trace command, as described in the Debug and Performance Tracing
article. (Any event group can be specified on the trace -J flag. Use the command trcevgrp -l to list all event groups.)
Here is an example in a shell script:
Please note that if the trace will be captured on one server and formatted on another, the output of trcnm and the /etc/trcfmt file must be collected from the server on which the trace is captured, as described in the Formatting a report on a different system
article.
Issues, and how to handle them
 | Note
If the AIX trcon command generates the message "/dev/systrctl: There is a request to a device or address that does not exist." when used to start a trace following a filemon -d, it is because the filemon command has not gotten the trace started when trcon is invoked. Try sleeping for a second after filemon and before trcon. |
 | Note As described in the filemon Command article, while monitoring I/O-intensive applications, the filemon command may not be able to consume trace events as fast as they are produced in real time. When that happens, the message:
will appear in near the top of the filemon report, indicating how many trace events were lost while trace buffers were full. The accuracy of the report will be diminished to some unknown degree. One way to prevent overflow is to monitor fewer levels of the file and I/O subsystems: the number of trace events generated is proportional to the number of levels monitored. Additionally, trace buffer size can be increased using the -T option, to accommodate larger bursts of trace events before overflow. But remember that increasing the trace buffer size will result in more pinned memory, and therefore may effect I/O and paging behavior. And as stated in the trace Daemon article:
When trace uses memory from the kernel heap which is the case for the -B option (32-bit kernel only), this memory remains part of kernel memory until the next reboot of the system. Thus, care should be taken when using large buffers.
and:
Also note that trace buffers use pinned memory, which means they are not pageable. Therefore, the larger the trace buffers, the less physical memory is available to applications. Unless the -b or -B flags are specified, the system attempts to allocate the buffer space from the kernel heap. If this request can not be satisfied, the system then attempts to allocate the buffers as separate segments.
Please note that the first caveat applies only when AIX V5.3 is running a 32-bit kernel (pretty rare nowadays). The command bootinfo -K will display 64 or 32 to indicate which kernel is currently being run. |
 | Note
Please note that while filemon reports 1.211 secs in measured interval in console output, that information does not appear in /tmp/filemon.out. It can, however, be calculated from information regarding a busy hdisk in the Detailed Physical Volume Stats report. For example, using /dev/hdisk8, multiply reads (141) times average read block size (60.7) and divide by throughput (3531.3) and blocks/KB (2) to get number of seconds analyzed by filemon: 141*60.7/(3531.3*2) = 1.212 seconds. If the hdisk handles writes as well as reads, those writes must be included with reads in the calculation. |
 | Note
AIX V5.3 filemon might report hdisk utilizations which are too high. That is, hdisk utilizations displayed by filemon may be incorrect, if so, and significantly higher than hdisk utilizations displayed by iostat. The problem is described by APAR IY66667 (DISK UTILIZATION VALUES GIVEN BY FILEMON ARE NOT ACCURATE). Before depending on hdisk utilization reported by filemon, confirm that the fix for APAR IY66667 is installed:
Please note that hdisk utilizations reported by filemon and by iostat will never be exactly the same. The iostat command's estimates of hdisk utilization are derived from kernel counters updated by AIX as it samples hdisk busy status. The iostat estimates are not as accurate as hdisk utilizations reported by (the fixed version of) filemon. |
AIX V5.3 iostat awk script
The following awk script can be used (with, for example, iostat 1 | awk -f resptimes.awk) to add I/O response time and block size to iostat output. With this method, you can see response times immediately and averaged over the one-second reporting interval specified on the iostat command in the example.
But please note that the response time estimate produced by the awk script is flawed in two ways:
- The response time estimate calculation assumes that there is no I/O overlap (that is, no two I/O operations are active simultaneously). As I/O load increases, some I/Os will inevitably overlap and the response time estimate derived from iostat will be too small.
- The response time estimate includes both read and write response times. If one is willing to make the bold assumption that read and write block sizes are the same, it is possible to use the iostat Kbps, Kb_read, and Kb_wrtn metrics to derive separate estimates of average read response time and average write response time. But that assumption is perhaps too bold (too likely to be wrong).
And please note that the iostat extended disk report includes service time and queue time metrics for each hdisk. (Response time = queue time + service time). Use the -D flag
on the AIX iostat command to request an extended disk report. See the Disk/Tape Utilization Report section of the AIX iostat command article
for documentation of metrics displayed in an extended disk report.
(
According to the AIX "From Strength to Strength" flyer
, iostat disk performance metrics were introduced in the AIX V5.3 update which became available on 15-April-05. According to the AIX V5.3 Fix packs web site
, AIX V5.3 Technology Level 2 was released in May 2005. So it seems likely that the iostat -D flag is available only in AIX V5.3 TL2 or above.)
In any case, please remember that hdisk response times measured on a server include latencies external to the disk storage subsystem (eg, the time taken to transit the SAN), so average response time delivered, measured, and reported by the disk storage subsystem will be lower than the average response time seen, measured, and reported by a server. And see the comment above regarding the LRU daemon, which can adversely affect I/O response times seen by a System p server.
The contents of this web page solely reflect the personal views of the authors and do not necessarily represent the views, positions, strategies or opinions of IBM or IBM management. Please use the
Add Comment link at the bottom of the page to provide feedback. Note: Until you sign in (using the link in the upper right corner of this web page), you will not see the
Add Comment link and you can not add a comment. If you do not already have an IBM ID, use the Register Now link on the sign in page to obtain one. Registration is quick and easy.