Exploring the Linux Storage Path - Tracing block I/O kernel events
Rizzio 270002CDTJ Visits (28363)
Typically, when the operating system is deployed it uses a generic configuration in order to provide a fair performance for any kind of application. The term application, in this paper, is used to reference any software that is executed over the operating system. Therefore, databases, web servers, e-mail servers, in-house made softwares, etc, all of them are referred here by the generic term “application”.
The behavior of each application depends on how it was designed and developed, but also on how it is used. At the end, the behavior of the application is reflected in the behavior of the o.s. and, considering the limitations of the underlying hardware to fulfill the o.s. requests, the application behavior is influenced back.
If we consider the data flow, from the application down to the hardware, there are several layers of software and hardware working together, which we can call the I/O path.
The objective of this work is to present a method to analyze the I/O path, and how to tune the operating system to improve the performance of the block I/O operations.
Enabling the Linux kernel to trace block I/O events
When the kernel is enabled for tracing, the compiler inserts a small No-Operation instruction of 5 bytes at the beginning of every kernel function. That instruction is used in the tracer calls, when tracing is enabled, to gather the timestamp of each function entry.
When tracing is disabled, the overhead of the instructions is very small.
Specifying an alternate directory to hold the new kernel, load the Linux kernel configuration interface:
# cd /usr/src/linux # make O=/fs1/newkernel menuconfig
Press <exit> until you return to the prompt. A message like this will appear:
# # configuration written to .config # *** End of Linux kernel configuration. *** Execute 'make' to build the kernel or try 'make help'.
You can confirm your choices by doing something like this:
#/usr/src/linux> grep -i tracer /fs1
Compile the kernel
# make O=/fs1/newkernel
Install the new kernel and modules
# sudo make O=/fs1/newkernel modules_install install sh /usr
As states the Release Notes for SUSE Linux Enterprise Server 11, every kernel module has a flag 'supported' which may assume the values "yes", "external", "" (empty, not set, "unsupported"), and all modules of a recompiled kernel are by default marked as unsupported. Therefore, if the machine is rebooted now in order to activate the ftrace kernel, all the recompiled modules will not be loaded, failing with "unsupported modules" messages.
To surpass this problem, edit the file /etc
Rename the initrd created during the compilation process:
# mv vmli
Create a new initrd for the ftrace kernel
# mkinitrd -k vmli
Now, reboot the machine:
# shutdown -r now
The Ftrace uses the file system debugfs to hold its control files and the files used for output. The debugfs is automatically configured into the kernel when ftrace is configured and its mount point directory /sys/kernel/debug is also created.
You can include debugfs or perhaps it is already included in /etc/fstab and it is mounted automatically at boot time. Verify whether it is already mounted by checking the file /etc/mtab. If it is not mounted, mount it:
# mount -t debugfs nodev /sys/kernel/debug
Go to the ftrace working directory.
# cd /sys
You can list all the functions that ftrace is able to trace by looking into the file avai
The file named "trace" holds the output of what is being traced, in human readable format. It is also possible to control the level of information in this file. You can do this by changing the contents of the file trace_options. In order to see which options are available, just list its contents. The words beginning with the string "no" represent trace options that are disabled. To enable an option, just echo in the option without the string "no"; or vice-versa. For example:
To enable the option stacktrace:
# echo stacktrace > trace_options
To disable the option stacktrace:
# echo nostacktrace > trace_options
The start and stop of the tracing activity is controlled by the file tracing_enabled, by just echoing 1 and 0 into it, respectivelly. The recording of the ring buffer, which is what we see at the "trace" file, is controlled by echoing 1 and 0 in the file tracing_on. When tracing_on holds 0 but tracing_enabled holds 1, the calls made by the tracers still happen, which imply some overhead, but these calls notice that the ring buffer is not recording and therefore they will not write any data into it.
The amount of data recorded by the tracer depends on the size of its buffer, which is controlled by the file buffer_size_kb. The number in it represents the number of records that the tracer can capture per CPU. In order to modify this file, the tracing activity must be stopped and no tracer must have been chosen. For example:
# echo 0 > tracing_enabled # echo nop > current_tracer # echo 1000 > buffer_size_kb
A very interesting feature of ftrace is its ability to trace events, which are grouped by subsystem. They are listed in the file available_events in the format <sub
To enable the tracing of the specific event block_bio_complete:
# echo 1 > even
To enable the tracing of all events of the block subsystem:
# echo 1 > events/block/enable
To enable the tracing of all the available events:
# echo 1 > events/enable
The events enabled are listed in the file set_event. It is also possible to enable events by echoing them directly into this file.
Monitoring the block I/O
Now, we proceed with the monitoring of the block I/O activity.
Verify if the tracing activity is started:
# cat tracing_enabled 1
Specify that all the block I/O events are to be traced:
# echo 0 > tracing_on # echo blk > current_tracer # echo 1 > events/block/enable # cat set_event bloc
Now, start the capture of the tracing information, execute the application that you want to analyze and stop the capture:
# echo 1 > tracing_on <<< execute the application >>> # echo 0 > tracing_on
See the captured tracing information:
# cat trace tracer: blk flush-8:0-1117  1677.505600: block_remap: 8,0 W 1831633 + 8 <- (8,2) 787408 flush-8:0-1117  1677.505603: block_bio_queue: 8,0 W 1831633 + 8 [flush-8:0] flush-8:0-1117  1677.505607: block_getrq: 8,0 W 1831633 + 8 [flush-8:0] flush-8:0-1117  1677.505610: block_plug: [flush-8:0] flush-8:0-1117  1677.505611: block_rq_insert: 8,0 W 0 () 1831633 + 8 [flush-8:0] <idle>-0  1677.509490: block_unplug_timer: [swapper] 1 kblockd/0-18  1677.509502: block_unplug_io: [kblockd/0] 1 kblockd/0-18  1677.509507: block_rq_issue: 8,0 W 0 () 1831633 + 8 [kblockd/0] <idle>-0  1677.510832: block_rq_complete: 8,0 W () 1831633 + 8  dd-7382  1678.034199: block_bio_queue: 8,16 R 267280 + 8 [dd] dd-7382  1678.034212: block_getrq: 8,16 R 267280 + 8 [dd] dd-7382  1678.034215: block_plug: [dd] dd-7382  1678.034216: block_rq_insert: 8,16 R 0 () 267280 + 8 [dd] dd-7382  1678.034220: block_unplug_io: [dd] 1 dd-7382  1678.034222: block_rq_issue: 8,16 R 0 () 267280 + 8 [dd] <idle>-0  1678.277551: block_rq_complete: 8,16 R () 267280 + 8  dd-7382  1678.277763: block_bio_queue: 8,16 R 267272 + 8 [dd] dd-7382  1678.277767: block_getrq: 8,16 R 267272 + 8 [dd] dd-7382  1678.277769: block_plug: [dd] dd-7382  1678.277770: block_rq_insert: 8,16 R 0 () 267272 + 8 [dd] dd-7382  1678.277773: block_unplug_io: [dd] 1 dd-7382  1678.277775: block_rq_issue: 8,16 R 0 () 267272 + 8 [dd] <idle>-0  1678.288475: block_rq_complete: 8,16 R () 267272 + 8 
To clean up the trace output before a new execution
# echo 0 > trace
Let's use the latency format of the trace options:
# echo latency-format > trace_options # echo 1 > tracing_on <<< execute the application >>> # echo 0 > tracing_on # cat trace flush-8:-12819 0..... 115357738us+: block_bio_queue: 8,16 W 267280 + 8 [flush-8:16] flush-8:-12819 0..... 115357753us+: block_getrq: 8,16 W 267280 + 8 [flush-8:16] flush-8:-12819 0d.... 115357755us : block_plug: [flush-8:16] flush-8:-12819 0d.... 115357756us+: block_rq_insert: 8,16 W 0 () 267280 + 8 [flush-8:16] flush-8:-12819 0..... 115357759us : block_bio_queue: 8,16 W 267288 + 8 [flush-8:16] flush-8:-12819 0d.... 115357761us+: bloc
Now, the question is: what are the meanings of the columns ?
At the first column, we have kernel thread and pid.
At the second column, we have several spots:
When using the latency trace option, the timestamp is relative to the start of the trace, in microseconds.
The field after the timestamp and before the colon is set to either '!' or '+' to call attention to especially long delays. The meanings of the symbols are:
Next, we have the block I/O events. The meanings of all of them are given ahead. Let's describe the structure of two of them:
block_bio_queue: 8,16 W 267288 + 8 [flush-8:16] <event name>: <device major number>,<device minor number> <type of I/O operation> <sector number> + <amount of sectors> [kernel thread] block_rq_complete: 8,16 W () 267280 + 16  <event name>: <device major number>,<device minor number> <type of I/O operation> __get_str(cmd) <sector number> + <amount of sectors> [errors]
What to look for ?
The most obvious observation is the elapsed time to complete the I/O operation. It is possible to infer this time by subtracting the timestamp of the block_rq_complete from the timestamp of the previous block_rq_issue. This operation reveals the time taken by the device driver to process the I/O request, which in most cases includes the SAN and storage unit processing time. Depending on the vendor/model of the storage unit and the configuration of its LUNs, an elapsed time between 5ms and 10ms is adequate.
The second observation, would be any errors in the block_rq_complete events, which can be revealed by the string between brackets [ ]. A further investigation on this errors is recommended.
If events of the type block_rq_requeue are observed, that could means the SAN or storage unit are not being able to handle the I/O operations, due to a high traffic for example.
The events bloc
The number of in-flight I/O operations can be modified, considering sometimes is better to have the device queue depth smaller than the scheduler depth:
# echo 256 > /sys
The Kernel I/O scheduler may also affect these events. You can change it for each specific disk:
# echo cfq > /sys
If the application has an I/O profile in which read operations are more important than writes, it could be recommended to increase the prefetch size, for example for the disk sdc:
# echo 2048 > /sys
The prefetch size can and must also be increased in the storage unit.
The event block_split may reveal that exists a problem in terms of I/O alignment between the layers: application, filesystem, LVM, partitioning and the RAID configuration.
In this case, when creating filesystems using mkfs, use the extended options to inform the ext2,3,4 driver that the underlying disk is actually a RAID array:
-E extended-options stride=stride-size stri
Also, create the physical volume in LVM aligned with the underlying RAID disk:
pvcreate -M2 --dataalignment <chunk size>
In summary, there are many layers to be investigated, whenever they exist: application, filesystem, LVM, partitioning, software raid, multipath driver, HBA driver, SAN and Storage Unit.
Description of the block I/O events
block_rq_abort - Abort Block Operation Request
block_rq_insert - Insert Block Operation Request into Queue
block_rq_issue - Issue Pending Block IO Request Operation to Device Driver
block_rq_requeue - Place Block IO Request Back on a Queue
block_rq_complete - Block IO Operation Completed by Device Driver
block_bio_bounce - Used Bounce Buffer When Processing Block Operation
block_bio_complete - Completed All Work on the Block Operation
block_bio_backmerge - Merging Block Operation to the End of an Existing Operation
block_bio_queue - Putting New Block IO Operation in Queue
block_getrq - Get a Free Request Entry in Queue for Block IO Operations
block_sleeprq - Waiting to Get a Free Request Entry in Queue for Block IO Operation
block_plug - Keep Operations Requests in Request Queue
block_unplug_timer - Timed Release of Operations Requests in Queue to Device Driver
block_unplug_io - Release of Operations Requests in Request Queue
block_split - Split a Single bio struct into Two bio structs
block_remap - Map Request for a Partition to the Raw Device