The short story: fio and etcd
The performance of your etcd cluster depends strongly on the performance of the storage backing it. To help you understand the relevant storage performance,
etcd exports some Prometheus metrics. One of them is wal_fsync_duration_seconds. etcd docs suggest that the 99th percentile of this metric should be less than 10ms for storage to be considered fast enough. If you’re thinking about running an
etcd cluster on Linux machines and need to assess whether your storage (e.g., SSDs) is fast enough, one option is using fio, a popular I/O tester. To do that, you can run the following command where test-data is a directory under the mount point of the storage device you’re testing:
All you have to do then is look at the output and check if the 99th percentile of fdatasync durations is less than 10ms. If that is the case, then your storage is fast enough. Here is an example output:
A few notes:
We tuned the values for the
--bsparameters in the above example for our specific scenario. To get meaningful insights from
fio, you should use the values that best apply to your case. To learn how to derive them, read about how we found out how to configure fio.
During the test, the I/O load that
fiogenerates is the only I/O activity. In a realistic scenario, it is likely that there would be other writes to storage besides those associated with wal_fsync_duration_seconds. Such additional load can make wal_fsync_duration_seconds bigger. So, if the 99th percentile you observe with
fiois only slightly below 10ms, it’s likely that your storage is not fast enough.
You need a fio version at least as new as 3.5 because older versions don’t report
The output above is only a small excerpt from the whole output from
The long story: fio and etcd
A bit of background on etcd WALs
Databases commonly use write-ahead logging;
etcd uses it too. Details about write-ahead logging are beyond the scope of this post, but what we need to know for our purpose is this—each
etcd cluster member keeps a write-ahead log (WAL) on persistent storage.
etcd writes certain operations on the key-value store (e.g., updates) to the WAL before applying them. If a member crashes and restarts between snapshots, it can locally recover transactions done since the last snapshot by looking at the content of the WAL.
So, any time a client adds a key to the key-value store or updates the value of an existing key,
etcd appends an entry recording the operation to the WAL—which is a normal file on persistent storage. Before it can proceed further,
etcd MUST be 100% confident that the WAL entry has been actually persisted. To achieve this on Linux, it is not enough to use the write system call because the actual writing to physical storage might be delayed. For example, Linux might keep the written WAL entry for some time in a kernel in-memory cache (e.g., page cache). To ensure that the data has been written to persistent storage, you have to invoke the
fdatasync system call after the
write—that’s exactly what
etcd does (as shown in the following straceoutput, where 8 is the file descriptor of the WAL file):
Unfortunately, writing to persistent storage takes time. If
fdatasync takes too long,
etcd system performance degrades. etcd documentation suggests that for storage to be fast enough, the 99th percentile of
fdatasync invocations when writing to the WAL file must be less than 10ms. There are other metrics related to storage that are relevant, but this one is the main focus of this post.
Using fio to assess storage
If you have some storage and want to assess whether it is suitable to back
etcd, you can use fio—a very popular I/O tester. Remember that disk I/O can happen in a lot of different ways—sync vs. async, many different classes of system calls, etc. The flip side of the coin is that
fio is extremely complex to use. It has a lot of parameters, and different combinations of their values yield completely different I/O workloads. To get meaningful numbers with respect to
etcd, you have to make sure that the write load generated by
fio is as similar as possible to that generated by
etcd when writing to WAL files.
This means that, at the very least, the load generated by
fio must be a series of sequential writes to a file, where each write is made up by a write system call followed by a
fdatasync system call. To get the sequential writes, you have to provide
fio with the flag
--rw=write. To make sure that
fio writes using the
write system call—as opposed to other system calls (e.g. pwrite)—use
--ioengine=sync. Finally, to make sure that each
fio invokes is followed by a
--fdatasync=1. The two other parameters in the example,
--bs, might vary depending on your specific scenario. Read the next section to learn how to tune them.
Why we used fio and how we found out how to configure it
This blog post stems from a real scenario we faced. We had a Kubernetes v1.13 cluster monitored with Prometheus.
etcd v3.2.24 was backed by SSDs. The metrics concerning
fdatasync latencies were too high, even when the cluster was idling. We found those metrics a little hard to believe, and we were not sure exactly what they really represent. Also, the cluster was made up of VMs; how could we tell if the physical SSDs were indeed too slow or if virtualization was introducing a delay? We also had various possible changes to hardware and software configuration that we could make and would need a way to evaluate them. We could run
etcd in each configuration and look at its Prometheus metrics, but that takes more than a little work. We wanted a simple way to evaluate a given configuration. We wanted to validate our understanding of the Prometheus metrics from
To do that, we had to solve two problems, though. First, what does the I/O workload generated by
etcd when writing to WALs look like? Which system calls are used? What’s the size of the writes? Second, assuming one has the answer to the aforementioned questions, how do you reproduce a similar workload with
fio is extremely flexible—there are a lot of parameters. We solved both problems with the same approach, centered on the lsof and strace commands.
lsof can be used to display all the file descriptors used by a process and the files they’re associated with.
strace can be used to examine an already-running process or to launch a process and examine it.
strace outputs all the system call invocations by the examined process—and, optionally, its children. The latter is important for processes that fork, and
etcd is such a process.
The first thing we did was to use
strace to examine the
etcd server backing Kubernetes while the cluster was idling. This showed us that the WAL write sizes were very tightly grouped, almost all in the range 2200–2400 bytes. That’s why in the command at the top of this post there’s the flag
--bs=2300 (bs is the size in bytes of each write
fio makes). Notice that the size of the
etcd writes might vary depending on
etcdversion, deployment, parameters values, etc., and it affects
fdatasync duration. If you have a similar use case, you should analyze your own
etcd processes with
strace to get meaningful numbers.
Next, to get a clear and comprehensive view of the filesystem activities of
etcd, we launched it under
-ffttT flags, meaning to examine the forked processes and write each one’s output in a separate file and also to give detailed reports of the start and elapsed time of each system call. We also used
lsof to confirm our parsing of the
strace output regarding which file descriptor was being used for which purpose. These led to the sort of
strace output shown above. Doing statistics on the sync times confirmed that the wal_fsync_duration_seconds metric from
etcd corresponds with the
fdatasync calls with the WAL file descriptors.
fio to generate a workload similar to
etcd’s, we read the
fio docs and developed parameters to serve our purpose. We confirmed the system calls and their timing by running
strace, as we did for
We took care when setting the value of the
--size parameter, which represents the total I/O
fio generates. In our case, that’s the total number of bytes written to storage—which is directly proportional to the number of
fdatasync) system call invocations. For a given
bs, the number of
fdatasync invocations is
size/bs. Since we were interested in a percentile, we wanted the number of samples to be high enough to be statistically relevant, and we found 10^4 (which makes up for a size of 22 MiB) to serve our purpose. Lower values of
--size have more pronounced noise (e.g., few fdatasync invocations that take way longer than usual and affect the 99th percentile).
Try it out
We have shown how to use
fio to evaluate whether your intended storage for
etcd is fast enough to support good
etcd performance. Go forth and test! For example, you can get VMs with SSD storage on the IBM Cloud.