Using Fio to Tell Whether Your Storage is Fast Enough for Etcd

By Matteo Olivi and Mike Spreitzer

The short story: fio and etcd

The performance of your etcd cluster depends strongly on the performance of the storage backing it. To help you understand the relevant storage performance, etcd exports some Prometheus metrics. One of them is wal_fsync_duration_seconds. etcd docs suggest that the 99th percentile of this metric should be less than 10ms for storage to be considered fast enough. If you’re thinking about running an etcd cluster on Linux machines and need to assess whether your storage (e.g., SSDs) is fast enough, one option is using fio, a popular I/O tester. To do that, you can run the following command where test-data is a directory under the mount point of the storage device you’re testing:

fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest

All you have to do then is look at the output and check if the 99th percentile of fdatasync durations is less than 10ms. If that is the case, then your storage is fast enough.  Here is an example output:

fsync/fdatasync/sync_file_range:
  sync (usec): min=534, max=15766, avg=1273.08, stdev=1084.70
  sync percentiles (usec):
   | 1.00th=[ 553], 5.00th=[ 578], 10.00th=[ 594], 20.00th=[ 627],
   | 30.00th=[ 709], 40.00th=[ 750], 50.00th=[ 783], 60.00th=[ 1549],
   | 70.00th=[ 1729], 80.00th=[ 1991], 90.00th=[ 2180], 95.00th=[ 2278],
   | 99.00th=[ 2376], 99.50th=[ 9634], 99.90th=[15795], 99.95th=[15795],
   | 99.99th=[15795]

 

A few notes:

  • We tuned the values for the --size and --bs parameters in the above example for our specific scenario. To get meaningful insights from fio, you should use the values that best apply to your case. To learn how to derive them, read about how we found out how to configure fio.

  • During the test, the I/O load that fio generates is the only I/O activity. In a realistic scenario, it is likely that there would be other writes to storage besides those associated with wal_fsync_duration_seconds. Such additional load can make wal_fsync_duration_seconds bigger. So, if the 99th percentile you observe with fio is only slightly below 10ms, it’s likely that your storage is not fast enough.

  • You need a fio version at least as new as 3.5 because older versions don’t report fdatasync duration percentiles.

  • The output above is only a small excerpt from the whole output fromfio.

The long story: fio and etcd

A bit of background on etcd WALs

Databases commonly use write-ahead loggingetcd uses it too. Details about write-ahead logging are beyond the scope of this post, but what we need to know for our purpose is this—each etcd cluster member keeps a write-ahead log (WAL) on persistent storage. etcd writes certain operations on the key-value store (e.g., updates) to the WAL before applying them. If a member crashes and restarts between snapshots, it can locally recover transactions done since the last snapshot by looking at the content of the WAL.

So, any time a client adds a key to the key-value store or updates the value of an existing key, etcd appends an entry recording the operation to the WAL—which is a normal file on persistent storage. Before it can proceed further, etcd MUST be 100% confident that the WAL entry has been actually persisted. To achieve this on Linux, it is not enough to use the write system call because the actual writing to physical storage might be delayed. For example, Linux might keep the written WAL entry for some time in a kernel in-memory cache (e.g., page cache). To ensure that the data has been written to persistent storage, you have to invoke the fdatasync system call after the write—that’s exactly what etcd does (as shown in the following straceoutput, where 8 is the file descriptor of the WAL file):

21:23:09.894875 lseek(8, 0, SEEK_CUR)   = 12808 <0.000012>
21:23:09.894911 write(8, ".\0\0\0\0\0\0\202\10\2\20\361\223\255\266\6\32$\10\0\20\10\30\26\"\34\"\r\n\3fo"..., 2296) = 2296 <0.000130>
21:23:09.895041 fdatasync(8)            = 0 <0.008314>

 

Unfortunately, writing to persistent storage takes time. If fdatasync takes too long, etcd system performance degrades. etcd documentation suggests that for storage to be fast enough, the 99th percentile of fdatasync invocations when writing to the WAL file must be less than 10ms. There are other metrics related to storage that are relevant, but this one is the main focus of this post.

Using fio to assess storage

If you have some storage and want to assess whether it is suitable to back etcd, you can use fio—a very popular I/O tester. Remember that disk I/O can happen in a lot of different ways—sync vs. async, many different classes of system calls, etc. The flip side of the coin is that fio is extremely complex to use. It has a lot of parameters, and different combinations of their values yield completely different I/O workloads. To get meaningful numbers with respect to etcd, you have to make sure that the write load generated by fio is as similar as possible to that generated by etcd when writing to WAL files.

This means that, at the very least, the load generated by fio must be a series of sequential writes to a file, where each write is made up by a write system call followed by a fdatasync system call. To get the sequential writes, you have to provide fio with the flag --rw=write. To make sure that fio writes using the write system call—as opposed to other system calls (e.g. pwrite)—use --ioengine=sync. Finally, to make sure that each write fio invokes is followed by a fdatasync, use --fdatasync=1. The two other parameters in the example, --size and --bs, might vary depending on your specific scenario. Read the next section to learn how to tune them.

Why we used fio and how we found out how to configure it

This blog post stems from a real scenario we faced. We had a Kubernetes v1.13 cluster monitored with Prometheus. etcd v3.2.24 was backed by SSDs. The metrics concerning etcd showed fdatasync latencies were too high, even when the cluster was idling. We found those metrics a little hard to believe, and we were not sure exactly what they really represent. Also, the cluster was made up of VMs; how could we tell if the physical SSDs were indeed too slow or if virtualization was introducing a delay? We also had various possible changes to hardware and software configuration that we could make and would need a way to evaluate them. We could run etcd in each configuration and look at its Prometheus metrics, but that takes more than a little work. We wanted a simple way to evaluate a given configuration. We wanted to validate our understanding of the Prometheus metrics from etcd.

To do that, we had to solve two problems, though. First, what does the I/O workload generated by etcd when writing to WALs look like? Which system calls are used? What’s the size of the writes? Second, assuming one has the answer to the aforementioned questions, how do you reproduce a similar workload with fio? Remember, fio is extremely flexible—there are a lot of parameters. We solved both problems with the same approach, centered on the lsof and strace commands. lsof can be used to display all the file descriptors used by a process and the files they’re associated with. strace can be used to examine an already-running process or to launch a process and examine it. strace outputs all the system call invocations by the examined process—and, optionally, its children. The latter is important for processes that fork, and etcd is such a process.

The first thing we did was to use strace to examine the etcd server backing Kubernetes while the cluster was idling. This showed us that the WAL write sizes were very tightly grouped, almost all in the range 2200–2400 bytes. That’s why in the command at the top of this post there’s the flag --bs=2300 (bs is the size in bytes of each write fio makes). Notice that the size of the etcd writes might vary depending on etcdversion, deployment, parameters values, etc., and it affects fdatasync duration. If you have a similar use case, you should analyze your own etcd processes with strace to get meaningful numbers.

Next, to get a clear and comprehensive view of the filesystem activities of etcd, we launched it under stracewith the -ffttT flags, meaning to examine the forked processes and write each one’s output in a separate file and also to give detailed reports of the start and elapsed time of each system call. We also used lsof to confirm our parsing of the strace output regarding which file descriptor was being used for which purpose. These led to the sort of strace output shown above. Doing statistics on the sync times confirmed that the wal_fsync_duration_seconds metric from etcd corresponds with the fdatasync calls with the WAL file descriptors.

To get fio to generate a workload similar to etcd’s, we read the fio docs and developed parameters to serve our purpose. We confirmed the system calls and their timing by running fio from strace, as we did for etcd.

We took care when setting the value of the --size parameter, which represents the total I/O fio generates. In our case, that’s the total number of bytes written to storage—which is directly proportional to the number of write (and fdatasync) system call invocations. For a given bs, the number of fdatasync invocations is size/bs. Since we were interested in a percentile, we wanted the number of samples to be high enough to be statistically relevant, and we found 10^4 (which makes up for a size of 22 MiB) to serve our purpose. Lower values of --size have more pronounced noise (e.g., few fdatasync invocations that take way longer than usual and affect the 99th percentile).

Try it out

We have shown how to use fio to evaluate whether your intended storage for etcd is fast enough to support good etcd performance. Go forth and test! For example, you can get VMs with SSD storage on the IBM Cloud.

Be the first to hear about news, product updates, and innovation from IBM Cloud