Does Intel® Optane™ Persistent Memory Improve etcd Performance or Scalability?

6 min read

Looking at Intel® Optane™ Persistent Memory and optimizing etcd to enable Kubernetes cluster performance to scale well for larger cluster sizes.

As described in a previous blog post, the performance of etcd is strongly dependent on the performance of your backing storage. Given this, it is reasonable to ask the question: “If fast storage is required for etcd to have good performance, will even faster storage further improve etcd performance or enable etcd to scale to higher loads (i.e., improve scalability)?”

Enter Intel® Optane™ Persistent Memory, a new persistent memory technology that sits on the memory bus—resulting in higher bandwidth than traditional SSD or NVMe-based storage—and comes with much lower latency than traditional storage. As in the previous post, we are mainly interested in optimizing etcd to enable Kubernetes cluster performance to scale well for larger cluster sizes.

How should we test etcd scalability?

First, we need to decide how to test etcd performance. The write (or put) test of the etcd benchmark tool is the obvious place to begin, because read and watch commands are served from memory. The etcd performance docs provide variations for the read and write tests. The table below from the etcd docs is for the write (or put) test. We can vary the key and value sizes, the number of clients and connections per client, and whether the clients send write requests to just the leader or to both the leader and its followers. It is also possible to have the requests ordered sequentially (--sequential-keys) or randomly (default).

The table below from the etcd docs is for the write (or put) test.

Since we are trying to improve scaling of Kubernetes clusters, the next question is—what does the request traffic to etcd look like from a Kubernetes cluster? Generally, Kubernetes uses etcd to store configuration information, state, and metadata. The size of the key and value (in bytes) impacts the performance of the etcd cluster.

We started with a single node etcd server running with an SSD backing store. We ran a scaling test by changing the size of the key and value, using 1,000 clients that each had 100 connections. The charts below show that the performance of etcd is sensitive to the size of the values stored, but not as much to the size of the keys. When the value size is 2KiB, the throughput is nearly 30% lower, and the latency is nearly 40% higher than with 256B values. However, even 1KiB values still produce reasonable throughput and latency.

The charts below show that the performance of etcd is sensitive to the size of the values stored, but not as much to the size of the keys. When the value size is 2KiB, the throughput is nearly 30% lower, and the latency is nearly 40% higher than with 256B values. However, even 1KiB values still produce reasonable throughput and latency.
The charts below show that the performance of etcd is sensitive to the size of the values stored, but not as much to the size of the keys. When the value size is 2KiB, the throughput is nearly 30% lower, and the latency is nearly 40% higher than with 256B values. However, even 1KiB values still produce reasonable throughput and latency.

Scaling clients on a server with Intel® Optane™ Persistent Memory

We had access to a single 2-socket Cascade Lake system with PMem, and the system contained 1.5 TB (128 GB per DIMM) of Intel® Optane™ Persistent Memory distributed evenly among the 12 memory channels. We configured the system for app-direct mode and mounted a folder in the pmem0 device: sudo mount -t ext4 /dev/pmem0 /path/to/pmemDir/. The pmem0 device is accessible to cores on both sockets but is directly attached to only one of the sockets. So, we had to test to make sure we were running the etcd server on the socket with the lowest latency and highest bandwidth to the pmem0 device (otherwise we might be limited by the UPI bus). We used fio and the taskset utility for this testing.

We had to make do with just one etcd node (as opposed to a cluster) because we only had access to one system with PMem. The system had 56 physical cores spread between the two sockets (28 each), which is a lot more resources (both memory and cores) than is typically dedicated to an etcd cluster node. We attempted to deal with this by using the go environment variable GOMAXPROCS to set the number of CPUs used by the etcd process. We also used a version of etcd that is modified to use the PMDK for the WAL file. This should have enabled us to take advantage of the performance benefits of the PMem.

Using our Cascade Lake system and limiting the etcd server to 16 cores, we ran a test scaling the number of clients from 100 to 180,000. We also placed the backing store in different locations, either in the pmem0 directory mounted as above, in an Optane NVMe drive, or on a traditional flash-based SSD. We also used two values for the number of connections per client—5 and 50.

Using our Cascade Lake system and limiting the etcd server to 16 cores, we ran a test scaling the number of clients from 100 to 180,000. We also placed the backing store in different locations, either in the pmem0 directory mounted as above, in an Optane NVMe drive, or on a traditional flash-based SSD. We also used two values for the number of connections per client—5 and 50.

The results shown above demonstrate that the current version of etcd (derived from the master branch from GitHub around release 3.3) is not easily able to take advantage of the bandwidth and latency improvements available from PMem. It is likely that other bottlenecks in the software are preventing this new hardware from really shining. This could be in the boltdb backend or in the wal file management code.

Is resource utilization preventing etcd from scaling on PMem?

We started a small investigation into this by examining the resource utilization as the client load scales. The graph below shows how the CPU utilization of the etcd server (again, one node instead of a cluster) changed as the load increased. We used pidstat -C etcd -dul 5 to report the etcd CPU utilization and disk IO utilization. The graph shows that the CPU utilization was always below 1200%, showing that more than 16 CPUs didn’t help, and that it declined as the load increased. This indicates that some non-CPU bottleneck was slowing things down. The breakdown from pidstat also showed that more than 90% of the total CPU utilization was taken up in user code when the load was above 8,000 clients.

The graph shows that the CPU utilization was always below 1200%, showing that more than 16 CPUs didn’t help, and that it declined as the load increased. This indicates that some non-CPU bottleneck was slowing things down. The breakdown from pidstat also showed that more than 90% of the total CPU utilization was taken up in user code when the load was above 8,000 clients.

We first assumed that some sort of disk IO was the culprit; but again, the pidstat disk IO results did not support that. The graph below shows how disk IO scaled with the client load.

We first assumed that some sort of disk IO was the culprit; but again, the pidstat disk IO results did not support that. The graph below shows how disk IO scaled with the client load.

Finally, we also logged the network traffic from the client machine. The total network traffic was gathered directly from /proc/net/dev and the Python library call time.time() was used to calculate the execution time to derive the throughput. The graph below shows the total data traffic (both send and receive) from the client machine. Again, the throughput was reducing as the load increased.

The graph below shows the total data traffic (both send and receive) from the client machine. Again, the throughput was reducing as the load increased.

We next tried to verify that the disks were performing as expected according to etcd. We obtained the write-ahead-log average fsync duration (etcd_disk_wal_fsync_duration_seconds_sum divided by etcd_disk_wal_fsync_duration_seconds_count) as load scales for the different storage locations, as reported in the Prometheus metrics obtained by curl -L http://127.0.0.1:2379/metrics.

The results showed that the various storage locations were performing as expected, with SSD as the slowest, NVMe much faster, and PMem faster still. However, even the SSD showed fsync durations below 0.5 ms, even at the heaviest load (etcd documentation suggests 99th-percentile fsync duration should be below 10ms). This means that all storage options for this server, including the SSD, are more than enough for etcd.

The results showed that the various storage locations were performing as expected, with SSD as the slowest, NVMe much faster, and PMem faster still. However, even the SSD showed fsync durations below 0.5 ms, even at the heaviest load (etcd documentation suggests 99th-percentile fsync duration should be below 10ms). This means that all storage options for this server, including the SSD, are more than enough for etcd.

Could other factors prevent etcd from scaling with PMem?

We’ve established that single-node performance is bottlenecking on something within the software stack, even when disk performance is very good. However, we are mostly interested in the performance of clusters for their reliability. Clusters are not only limited by single-node performance, but also by the time necessary to obtain consensus. Even if we make single nodes very fast, the cluster performance may still be bottlenecked by the network connecting the nodes.

We set up a cluster in the IBM cloud for a look at cluster performance. We gathered the average peer-to-peer round trip time (RTT)—(etcd_network_peer_round_trip_time_seconds_sum divided by etcd_network_peer_round_trip_time_seconds_count), again from the Prometheus metrics—for a run with the backing store on ramdisk, which should have eliminated any disk IO issues. The data showed that at lower loads (below 8,000 clients), the RTT is below 0.5 ms—about the same order of magnitude as the SSD WAL fsync latency. At a higher load, the RTT quickly rose above 1ms.

The data showed that at lower loads (below 8,000 clients), the RTT is below 0.5 ms—about the same order of magnitude as the SSD WAL fsync latency. At a higher load, the RTT quickly rose above 1ms.

Conclusions

So, in the end, even though faster storage didn’t make single nodes perform any better, it is likely that network latency would limit performance even if faster storage worked. A good SSD is enough to get good performance out of your etcd cluster, and NVMe is probably reasonable if you want to be sure that storage is not a bottleneck. We’ll save our Intel® Optane™ Persistent Memory for other applications that benefit from its higher bandwidth and lower latency.

Acknowledgements

The authors would like to acknowledge valuable feedback and help from Surya Duggirala from IBM and Jantz Tran and Raghu Moorthy from Intel.

Be the first to hear about news, product updates, and innovation from IBM Cloud