Implementing a Scalable Parallel Reduction in Unified Parallel C
NancyWang 27000303HG Visits (6991)
A reduction is the process of combining elements of a vector (or array) to yield a single aggregate element. It is commonly used in scientific computations. For instance the inner product of two n-dimensional vectors x, y is given by:
This computation requires n multiplications and n-1 additions. The n multiplications are independent from each other, therefore could be executed concurrently. Once the additive terms have been computed they can be summed together to yield the final result.
Given the importance of reduction operations in scientific algorithms, many parallel languages provide support for user defined reductions. For example, OpenMP provides a user reduction clause to be used in the OMP parallel construct. In the following example the reduction clause indicates that "sum" is a reduction variable:
In the code snippet above each thread performs a portion of the additions that make up the final sum. At the end of the parallel loop, the partial sums are combined into the final result.In this article we will explore different implementations of a global sum reduction in Unified Parallel C, in an attempt to find an efficient and scalable implementation.
Unified Parallel C (UPC) is an extension to the C programming language that allows users to express parallelism in their code. The UPC language subscribes to the Partition Global Address Space programming model. Partitioned Global Address Space (PGAS) languages such as UPC are increasingly seen as a convenient way to enhance programmer productivity for High-Performance Computing (HPC) applications on large-scale machines.
A UPC program is executed by a fixed number of threads (THREADS) which are created at program startup. In general a program statement might be executed concurrently by all threads. To facilitate the distribution and coordination of work among threads UPC provides a rich set of language features. A tutorial on Unified Parallel C is available at http
Let's consider how a naive sum reduction could be written in Unified Parallel C:
At lines 5 - 6 we have declared a shared array "A" and a shared variable "sum". At line 10 - 12 we initialize all elements of A to 1. At line 15 - 17 we have attempted to perform the reduction by accumulating the sum of A's elements values in shared variable "sum". Is this program correct?
A possible program output is: