Storage
Using Fio to Tell Whether Your Storage is Fast Enough for Etcd
April 18, 2019 | Written by: Matteo Olivi and Mike Spreitzer
Categorized: Storage
Share this post:
The short story: fio and etcd
The performance of your etcd cluster depends strongly on the performance of the storage backing it. To help you understand the relevant storage performance, etcd
exports some Prometheus metrics. One of them is wal_fsync_duration_seconds. etcd docs suggest that the 99th percentile of this metric should be less than 10ms for storage to be considered fast enough. If you’re thinking about running an etcd
cluster on Linux machines and need to assess whether your storage (e.g., SSDs) is fast enough, one option is using fio, a popular I/O tester. To do that, you can run the following command where test-data is a directory under the mount point of the storage device you’re testing:
fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest
All you have to do then is look at the output and check if the 99th percentile of fdatasync durations is less than 10ms. If that is the case, then your storage is fast enough. Here is an example output:
fsync/fdatasync/sync_file_range: sync (usec): min=534, max=15766, avg=1273.08, stdev=1084.70 sync percentiles (usec): | 1.00th=[ 553], 5.00th=[ 578], 10.00th=[ 594], 20.00th=[ 627], | 30.00th=[ 709], 40.00th=[ 750], 50.00th=[ 783], 60.00th=[ 1549], | 70.00th=[ 1729], 80.00th=[ 1991], 90.00th=[ 2180], 95.00th=[ 2278], | 99.00th=[ 2376], 99.50th=[ 9634], 99.90th=[15795], 99.95th=[15795], | 99.99th=[15795]
A few notes:
- We tuned the values for the
--size
and--bs
parameters in the above example for our specific scenario. To get meaningful insights fromfio
, you should use the values that best apply to your case. To learn how to derive them, read about how we found out how to configure fio. - During the test, the I/O load that
fio
generates is the only I/O activity. In a realistic scenario, it is likely that there would be other writes to storage besides those associated with wal_fsync_duration_seconds. Such additional load can make wal_fsync_duration_seconds bigger. So, if the 99th percentile you observe withfio
is only slightly below 10ms, it’s likely that your storage is not fast enough. - You need a fio version at least as new as 3.5 because older versions don’t report
fdatasync
duration percentiles. - The output above is only a small excerpt from the whole output from
fio
.
The long story: fio and etcd
A bit of background on etcd WALs
Databases commonly use write-ahead logging; etcd
uses it too. Details about write-ahead logging are beyond the scope of this post, but what we need to know for our purpose is this—each etcd
cluster member keeps a write-ahead log (WAL) on persistent storage. etcd
writes certain operations on the key-value store (e.g., updates) to the WAL before applying them. If a member crashes and restarts between snapshots, it can locally recover transactions done since the last snapshot by looking at the content of the WAL.
So, any time a client adds a key to the key-value store or updates the value of an existing key, etcd
appends an entry recording the operation to the WAL—which is a normal file on persistent storage. Before it can proceed further, etcd
MUST be 100% confident that the WAL entry has been actually persisted. To achieve this on Linux, it is not enough to use the write system call because the actual writing to physical storage might be delayed. For example, Linux might keep the written WAL entry for some time in a kernel in-memory cache (e.g., page cache). To ensure that the data has been written to persistent storage, you have to invoke the fdatasync
system call after the write
—that’s exactly what etcd
does (as shown in the following strace output, where 8 is the file descriptor of the WAL file):
21:23:09.894875 lseek(8, 0, SEEK_CUR) = 12808 <0.000012> 21:23:09.894911 write(8, ".\0\0\0\0\0\0\202\10\2\20\361\223\255\266\6\32$\10\0\20\10\30\26\"\34\"\r\n\3fo"..., 2296) = 2296 <0.000130> 21:23:09.895041 fdatasync(8) = 0 <0.008314>
Unfortunately, writing to persistent storage takes time. If fdatasync
takes too long, etcd
system performance degrades. etcd documentation suggests that for storage to be fast enough, the 99th percentile of fdatasync
invocations when writing to the WAL file must be less than 10ms. There are other metrics related to storage that are relevant, but this one is the main focus of this post.
Using fio to assess storage
If you have some storage and want to assess whether it is suitable to back etcd
, you can use fio—a very popular I/O tester. Remember that disk I/O can happen in a lot of different ways—sync vs. async, many different classes of system calls, etc. The flip side of the coin is that fio
is extremely complex to use. It has a lot of parameters, and different combinations of their values yield completely different I/O workloads. To get meaningful numbers with respect to etcd
, you have to make sure that the write load generated by fio
is as similar as possible to that generated by etcd
when writing to WAL files.
This means that, at the very least, the load generated by fio
must be a series of sequential writes to a file, where each write is made up by a write system call followed by a fdatasync
system call. To get the sequential writes, you have to provide fio
with the flag --rw=write
. To make sure that fio
writes using the write
system call—as opposed to other system calls (e.g. pwrite)—use --ioengine=sync
. Finally, to make sure that each write
fio
invokes is followed by a fdatasync
, use --fdatasync=1
. The two other parameters in the example, --size
and --bs
, might vary depending on your specific scenario. Read the next section to learn how to tune them.
Why we used fio and how we found out how to configure it
This blog post stems from a real scenario we faced. We had a Kubernetes v1.13 cluster monitored with Prometheus. etcd
v3.2.24 was backed by SSDs. The metrics concerning etcd
showed fdatasync
latencies were too high, even when the cluster was idling. We found those metrics a little hard to believe, and we were not sure exactly what they really represent. Also, the cluster was made up of VMs; how could we tell if the physical SSDs were indeed too slow or if virtualization was introducing a delay? We also had various possible changes to hardware and software configuration that we could make and would need a way to evaluate them. We could run etcd
in each configuration and look at its Prometheus metrics, but that takes more than a little work. We wanted a simple way to evaluate a given configuration. We wanted to validate our understanding of the Prometheus metrics from etcd
.
To do that, we had to solve two problems, though. First, what does the I/O workload generated by etcd
when writing to WALs look like? Which system calls are used? What’s the size of the writes? Second, assuming one has the answer to the aforementioned questions, how do you reproduce a similar workload with fio
? Remember, fio
is extremely flexible—there are a lot of parameters. We solved both problems with the same approach, centered on the lsof and strace commands. lsof
can be used to display all the file descriptors used by a process and the files they’re associated with. strace
can be used to examine an already-running process or to launch a process and examine it. strace
outputs all the system call invocations by the examined process—and, optionally, its children. The latter is important for processes that fork, and etcd
is such a process.
The first thing we did was to use strace
to examine the etcd
server backing Kubernetes while the cluster was idling. This showed us that the WAL write sizes were very tightly grouped, almost all in the range 2200–2400 bytes. That’s why in the command at the top of this post there’s the flag --bs=2300
(bs is the size in bytes of each write fio
makes). Notice that the size of the etcd
writes might vary depending on etcd
version, deployment, parameters values, etc., and it affects fdatasync
duration. If you have a similar use case, you should analyze your own etcd
processes with strace
to get meaningful numbers.
Next, to get a clear and comprehensive view of the filesystem activities of etcd
, we launched it under strace
with the -ffttT
flags, meaning to examine the forked processes and write each one’s output in a separate file and also to give detailed reports of the start and elapsed time of each system call. We also used lsof
to confirm our parsing of the strace
output regarding which file descriptor was being used for which purpose. These led to the sort of strace
output shown above. Doing statistics on the sync times confirmed that the wal_fsync_duration_seconds metric from etcd
corresponds with the fdatasync
calls with the WAL file descriptors.
To get fio
to generate a workload similar to etcd
’s, we read the fio
docs and developed parameters to serve our purpose. We confirmed the system calls and their timing by running fio
from strace
, as we did for etcd
.
We took care when setting the value of the --size
parameter, which represents the total I/O fio
generates. In our case, that’s the total number of bytes written to storage—which is directly proportional to the number of write
(and fdatasync
) system call invocations. For a given bs
, the number of fdatasync
invocations is size/bs
. Since we were interested in a percentile, we wanted the number of samples to be high enough to be statistically relevant, and we found 10^4 (which makes up for a size of 22 MiB) to serve our purpose. Lower values of --size
have more pronounced noise (e.g., few fdatasync invocations that take way longer than usual and affect the 99th percentile).
Try it out
We have shown how to use fio
to evaluate whether your intended storage for etcd
is fast enough to support good etcd
performance. Go forth and test! For example, you can get VMs with SSD storage on the IBM Cloud.

MSc Student at University of Bologna, IBM Research Intern

Principal RSM - Workload Centric Computing
What is Object Storage?
In our latest lightboarding video, Anirup Dutta explains how object storage works, lists some of the benefits, and give you some use cases for when object storage may be your best option.
How To Use IBM Cloud Object Storage with Veeam
As you may have heard, Veeam 9.5u4 now includes an integration with IBM Cloud Object Storage. This integration can result in up to 10x savings on long-term data retention and an overall reduction in IT and primary storage costs.
Introducing IBM Cloud Object Storage Firewall: Further Secure Your Data
IBM Cloud Object Storage (COS) is giving you more control over who can access your data. We have introduced a new capability allowing you to configure your buckets with trusted IP address(es) that will dictate access to the data in COS.