Assist Threads Part 1
stan_kvasov 270003C2Y5 Comments (3) Visits (2833)
The current microprocessor architectures pose another challenge – the memory latencies are rising relative to the clock speed of processors. This causes a core to pay a large performance penalty on a cache miss that occurs when a processor has to fetch data from main memory. Fetching a data from the main-memory can cause a stall – the core will wait for the required data to be retrieved.
In the XLC v11 and XLF v13 version of its compiler for AIX, IBM introduced a new feature that addresses both of these performance issues – assist threads. The IBM compiler can generate software prefetch threads that will run on an idle core or a SMT context and prefetch data into the shared cache of the processor. This has two advantages for sequential programs. First, the assist thread runs on an idle hardware context and hence does not slow down the main thread. Second, the data prefetched by the assist thread is re-used by the main application thread. Since the data is prefetched into a shared cache, this reduces the latency of many memory accesses and effectively speeds up the application. Assist threads have been used to get up to 2x performance improvement for applications that have irregular memory access patterns.
To generate an assist thread, the compiler has to be aware of which memory accesses miss in the cache and stall the core. These memory accesses are called delinquent loads. Delinquent loads can be identified by either profiling an application using tools like pmcount or they can be automatically identified by the compiler’s prof
Join us in the next installment to go through a working example of using assist threads.