Implementing a Scalable Parallel Reduction in Unified Parallel C (part 3)
NancyWang 27000303HG Visits (8035)
continue from the second parallel reduction blog.
To get better scalability (increased program performance as the number of threads increases), it is critical to remove the lock in the upc_forall loop. This can be done by accumulating the partial sum computed by each thread into a thread-local variable. A thread-local variable is allocated in the private memory space of each thread, thus there are THREADS “instances” of the variable. Each instance of the thread-local variable can be used to accumulate the sum of the array elements having affinity to each thread:
In the code fragment shown above the thread-local variable “partialsum” is used to store the sum of the array elements having affinity to the executing thread (MYTHREAD). For example THREAD 0 will add array elements A, A[THREADS], A[2*THREADS], etc… in its instance of “partialsum”. In order to compute the final result it is necessary to add the “partialsum” contributions from each thread. To avoid a race condition (write-after-write hazard on variable “sum”), we use the UPC lock functions to serialize the accesses on “sum”.
The performance result illustrated by Figure 2 demonstrates that the program is now “scalable”. That is the time taken to compute the reduction diminishes as the number of threads used to execute the program increases. The reason for this improvement is simple: the lock is now acquired THREADS times in total instead of being acquired by each thread in every loop iteration.
In this article, we illustrate the concept of a reduction operation and explained how to implement a parallel reduction in Unified Parallel C. We have shown how the UPC lock primitives can be used to guarantee program correctness. We have then compared the performance of two distinct correct reduction implementations, one using a lock inside a upc_forall loop; the other using thread-local variables to accumulate partial results on each thread. The performance measurements obtained clearly indicates that locks should be used judiciously (if at all) inside loops.
To get the complete version of the document, please go to http