Untangling memory access measurements - memory latency
Untangling memory access measurements - lat_mem_rd
By: Jenifer Hopper
In this article, we discuss an example benchmark called lat_mem_rd, from the lmbench3 suite. lat_mem_rd is often used in the community to measure memory read latency across a variety of systems. This article explores how this benchmark can be used to measure performance of the entire memory hierarchy, from caches to local and remote memory. This article takes a practical approach to measuring memory latency performance on Power Systems™, from understanding initial results to moving on to examining the effects of more in-depth system settings and tunings.
Memory access performance is an important factor of overall system performance. When we think about this critical aspect, many questions arise about the terms often used to describe the performance and the methodologies available to actually measure the performance. Sometimes memory latency is described using theoretical values to represent the fastest latency the hardware buses, DIMMs, memory controllers, etc. can possibly achieve when delivering the data over the wire. We know that this theoretical hardware performance value does not always reflect actual application performance due to many factors, including caching effects, data locality, and instruction sequences, among other things.
In this article we are not looking at these theoretical values. Instead, we are using a real application to measure a certain memory access pattern, focusing on out-of-the-box performance. Then, we are moving on to some additional system attributes and tuning options that have varying effects on the memory read latency performance.
The lmbench suite (http://www.bitmover.com/lmbench/) is a set of portable microbenchmarks that measure different types of bandwidth and latency. This article focuses on the lat_mem_rd test from lmbench3, which can be downloaded here: http://www.bitmover.com/lmbench/get_lmbench.html.
The lat_mem_rd test (http://www.bitmover.com/lmbench/lat_mem_rd.8.html) takes two arguments, an array size in MB and a stride size. The benchmark uses two loops to traverse through the array, using the stride as the increment by creating a ring of pointers that point forward one stride. The test measures memory read latency in nanoseconds for the range of memory sizes. The output consists of two columns: the first is the array size in MB (the floating point value) and the second is the load latency over all the points of the array. When the results are graphed, you can clearly see the relative latencies of the entire memory hierarchy, including the faster latency of each cache level, and the main memory latency. Note that this benchmark measures data accesses, not instruction cache accesses.
There are many different definitions of memory latency. Memory latency can sometimes refer to the memory chip, bus, or DIMM cycle latency. However, actual memory latency can depend on the instruction sequence. Some programs may use non-blocking loads, while others may have instructions after the load instruction that are dependent on that data, therefore increasing latency and possibly causing a stall in the pipeline. Many theoretical hardware latency numbers may not take into account these types of delays.
The mix of read (load) and write (store) operations that the application makes can have an impact on latency., because the buses that connect the processors to the memory subsystem have a certain mix of read to write operations defined in the hardware.
Another characteristic to consider is how often the data is updated, or whether the data stays "clean", meaning the data is unmodified. If the data is modified, a store operation is issued so that the modification is written back to storage, ensuring that all future copies contain the updated data.
The lat_mem_rd benchmark focuses on "pure" memory data read latency. This means that in some cases you could see a latency of "0". The benchmark assumes that all processors can complete a load instruction in one clock cycle. Therefore, a latency of "0" would mean that the read took only one clock cycle, and that the nanosecond latency value reported is based on how long the load takes after the first clock cycle. In other words, the latency reported does not include the time it took to execute the instructions, it focuses purely on the memory read latency. The benchmark measures "clean read" latency, meaning that it's not likely any stores operations were issued, so the data is unmodified and therefore the latency would not include any write back costs.
To build lmbench3 for these experiments, we downloaded the source,extracted the files using untar, and changed into the lmbench3 directory using cd. The lmbench3 README explains that you can run the full suite of tests with the make results command, however we chose to simply run make from the lmbench3/src directory to build the benchmarks, because we wanted to run lat_mem_rd separately. The executable files are created in the lmbench3/bin/powerpc64-linux-gnu directory on IBM Power systems.
Tip: There is a bug in the lmbench3 code (as noted in this kerneltrap.org thread for example: http://kerneltrap.org/node/14792). We encountered this error when running the make command:
gmake: *** No rule to make target `../SCCS/s.ChangeSet', needed by `bk.ver'. Stop.
gmake: Leaving directory '<dir>/lmbench3/src'
make: *** [lmbench] Error 2
We used the touch command to create the missing file to get the make to complete smoothly. From the lmbench3 directory, we entered the following:
Note that the default Makefile options were used for this article. Adding flags such as compiler optimization levels can cause some problems with the benchmark's accounting code, which can lead to inaccurate measurements.
We will discuss the details of the underlying memory hierarchy on POWER7+ systems so that you have an understanding of the system layout to compare to lat_mem_rd results in the next section.
Caches, Cache Lines, and Main Memory
Modern processors are capable of executing multiple instructions in one clock cycle. In order to achieve these execution speeds, processors can't sit around waiting on main memory. This is where caches play a critical role by temporarily holding data, providing much faster access times.
POWER7+ systems have multiple levels of cache:
- the L1 cache provides a split 32 K Data cache and 32 K Instruction cache
- the L2 cache size is 256 K
- the L3 cache size is 10 M
There is a private L1, L2, and L3 cache available to each processor (core) on a POWER7 chip. Note that the on-chip cache latencies are tied closely to the processor frequency (clock speed), since that determines how fast the processor can access the cache. The experiments in this article were run on a POWER7+ system with a processor frequency of 3.4 GHz.
When a program needs to access a byte of data, it first looks in the L1 cache to see if it can quickly access the data at processor frequency speeds. If the data is not found in any of the cache levels, instead of reading in a single byte, a block of memory that contains the data is read into the cache. This block of memory, called the cache line, is the smallest amount of memory that a cache can access at one time, and which fills a single "line" in the cache. This structure is used to increase efficiency since the next byte of data needed can often be found in the same cache line.. POWER7 cache lines are 128 bytes in size and are aligned on a 128-byte boundary. If the data is not in the L1, it moves on to the L2, where the data can be available in a few cycles. If the data is again not found, the next step is looking in the L3, where data can be available in tens of cycles.
POWER7 systems also have a concept called "L3 lateral cast-out", which means that if the data is not found in that core's L3, it can look in another core's L3 for the data. When there are inactive cores on the system, this essentially provides an extra level of cache. The inactive cores would not be placing new data in the cache, and therefore data is likely to "hang around" longer than normal in these extra L3 caches.
If the data is not found in any of the cache levels, main memory must be accessed, which is an order of magnitude slower. This means going off-chip, through the memory controller to the memory buffer chip and memory DIMMs to find the data. One important performance aspect of main memory is DIMM placement and density. Having less than fully populated DIMM slots may limit the resources required to access memory at top speeds, thereby causing performance impacts, depending on the application memory access pattern. Once the data is found in main memory, a 128-byte aligned, 128-byte block of main storage that contains the data is pulled into the processor's cache for quick access.
So now we know how a cache line gets placed into a cache. But how do the caches manage all these cache lines? They are organized as an array, with one dimension containing the cache line "aging" info and another dimension containing an index that helps determine the actual address. As the number of cache lines fill up in each cache level, eventually the maximum space is reached and the oldest cache lines are "cast-out" to the higher level of cache, or eventually evicted altogether.
Note that store-back cache operations, where data in the cache is changed and must be written back to main memory, and the POWER cache coherency methodology are beyond the scope of this article, but are also important factors in overall cache operation and performance.
When thinking about caches, one should consider how SMT (simultaneous multi-threading) pressure affects cache usage. On POWER7 systems, there can be up to four threads per core. If all threads are actively accessing the cache, you can think of the cache as being split into 1/4 of the size since it is loading and storing data for each thread. On the other hand, having additional SMT threads online can help performance when a processor is waiting on a cache miss (meaning the data must be fetched from main memory), since the other threads can continue executing instructions during this time while other work is ready.
Data prefetching is an important topic when thinking about memory read latency. Power systems use hardware-based memory prefetching to improve performance by reducing the impact of cache miss latency. The hardware data prefetch engine can recognize a sequential data access pattern, as well as some non-sequential patterns and will prefetch data cache lines from the L2, L3, and main memory into the L1 data cache for quick access.
The Data Stream Control Register (DSCR) controls the aggressiveness of the prefetching for loads and stores. Aggressiveness in this context means how much the hardware prefetches at once. Each bit of the DSCR controls a different aspect of the aggressiveness, including whether to enable the load and/or store stream, whether to enable the "Stride-N", which will detect streams that have a stride greater than a single cache block, and whether to set degrees of prefetch depth and urgency.
First, let's examine some "out of the box" lat_mem_rd results on a POWER7+ system and see how the latency ties directly to the underlying memory hierarchy. Then, we will walk through some performance experiments to see the effects of different tuning parameters and hardware effects of the system.
In order to test all levels of memory, you should pick an array size large enough so that it will not fit in cache. We are using a 2 GB array size for the runs in this article. We will start with a single thread run and move to multi-thread results later. For this first run, we are using a stride of 128, which matches the POWER7+ cache line size. To minimize any slight impacts from the scheduler and/or context switches for the tests in this article, SMT was set to "off" using the command ppc64_cpu --smt=off and we used taskset to pin the test to a single CPU. For example:
taskset -c 4 ./lat_mem_rd 2000 128
As you can see from the following chart, the three POWER7+ cache levels are represented by the plateaus created from the lat_mem_rd latency results and closely match each cache size. The gradual ramp up once the array is too large to fit in the local L3 is due to effects of the L3 lateral cast-out. We will look at this in more detail later.
Now, let's examine the effects of different lat_mem_rd stride values. As you can see in the following graph, having a stride smaller than the cache line size (128B on POWER7+) improves performance since there could be multiple hits per cache line. In addition, having a stride larger than the cache line size shows a performance impact since it is no longer seeing prefetching effects, which we will discuss in the next section.
As mentioned previously, data prefetching can have a big impact on memory latency performance and POWER7+ systems provide the DSCR which allows the user options to tune the aggressiveness of hardware prefetching. You can set the DSCR value using the ppc64_cpu --dscr command. For example:
First, let's examine various DSCR prefetch settings with stride 128. Notice that the default DSCR value on POWER7+ systems (0) is actually equivalent to the value 4, which means a medium prefetch.
As you can see in the previous chart, turning off prefetching (by setting DSCR to 1, which means "none") causes a performance degradation, which shows that stride 128 benefits from hardware data prefetching.
Now, let's narrow down to the most interesting DSCR settings so that the chart is easier to read. This time we will look at the default (0, which is equivalent to "4" or "medium"), the deep, deeper, and deepest settings, and the stride-N setting.
The previous chart shows that 7 (deepest) and 15 (deepest + store stream enable) provide the best performance. The difference between the two is likely just run variance, because there should not be many store or write operations to prefetch since lat_mem_rd measures clean reads. Enabling the "store stream" means that cache lines will get prefetched for store operations since the cache line will need to go into modified state in order to be written to. For example, think of a case where only half of the cache line needs to be modified. In that case, the rest of the cache line still needs to be pulled in for the store operation, and pulling it in ahead of time (by prefetching) may help performance.
The other interesting data point is DSCR=16, or "stride-N stream enable". We will talk more about this setting next; however, notice here that the performance is similar to the default DSCR setting for stride 128.
So while the stride value is equal to the cache line size, hardware data prefetching can help performance. What about when the stride is larger than the cache line size? Let's examine the results for a few key DSCR settings with stride 256.
As you can see in this chart, there is almost no difference between DSCR 0 (default, medium), DSCR 1 (no prefetching), and DSCR 7 (deepest prefetch). However, DSCR 16 does provide a performance improvement similar to the prefetch improvement seen with stride 128. Why is this? The "stride-N stream enable" (16) setting enables detection of streams that have a stride greater than the cache line size. This means, once this setting is enabled, access patterns using strides larger than 128 can also benefit from hardware data prefetching.
Thus far all the results in this article have naturally been using memory that is "local" to the running CPU. However, Power systems use a NUMA (Non Uniform Memory Access) architecture. Let's examine what would happen if an application running on a CPU was forced to use memory from a distant NUMA node or even memory that is interleaved throughout all the NUMA nodes, and compare the performance to that of a run using the CPU's local NUMA node.
You can view the CPU and memory NUMA node layout of your system with the numactl --hardware command. To achieve the desired CPU and memory affinity, we used the numactl affinity command. Following are some example commands.
Local NUMA node:
numactl --physcpubind=4 --membind=0 ./lat_mem_rd 2000 128
Interleaved NUMA nodes:
numactl --physcpubind=4 --interleave=all ./lat_mem_rd 2000 128
Distant NUMA node:
numactl --physcpubind=4 --membind=3 ./lat_mem_rd 2000 128
As you can see in this chart, accessing memory that is not local to the running CPU can have performance impacts. Furthermore, when the memory is interleaved, the performance can vary since the access pattern may hit the non-local nodes randomly.
Finally, let's examine the performance effects when running multiple tests in parallel, each pinned to a separate CPU. For this test, we left SMT set to "off", and pinned one test to each CPU, with up to five CPUs running in parallel. Following are some example commands.
taskset -c 4 ./lat_mem_rd 2000 128
# cat test
taskset -c 0 ./lat_mem_rd 2000 128 > mem.0 2>&1 &
taskset -c 4 ./lat_mem_rd 2000 128 > mem.4 2>&1 &
taskset -c 8 ./lat_mem_rd 2000 128 > mem.8 2>&1 &
taskset -c 12 ./lat_mem_rd 2000 128 > mem.12 2>&1 &
taskset -c 16 ./lat_mem_rd 2000 128 > mem.16 2>&1
There are two interesting points from this chart. First, you can see the effects of the L3 "lateral cast-out" for the single thread run. The gradual slope after the 10 MB data point for the red line compared to the other lines shows the advantages to allowing the single thread to use the other thread's L3 caches while those threads are idle. Secondly, the impacts of multiple data requests in parallel is shown by the vertical difference of the single thread line and the multi-thread runs.
In conclusion, this article explored various system characteristics and tuning options that can affect memory read latency performance. These included stride values and how they relate to the cache line size, DSCR tunings, NUMA effects, and the effects when there are multiple threads running in parallel.