with Tags:
reduction
X

## OpenMP 4.0 about to be released and IWOMP 2013
So much has been happening in OpenMP since SC 12 that I hope to capture it all in this post while flying to the ADC C++ meeting where I will talk about C++14, ISOCPP.org, and Transactional Memory. First, the research arm of OpenMP is IWOMP, the annual research conference. You probably know by now that IWOMP 2013 will be in Canberra, Australia instead of its usual June summer time frame. This means that there is time (up till May 10) for new proposal submission. So if you have some research in OpenMP that should be exposed, please submit a... [More]
Tags: reduction atomics 2003 iwomp tasks tools 4.0 defined error fortran model user simd 2013 accelerators openmp |

## Implementing a Scalable Parallel Reduction in Unified Parallel C
A reduction is the process of combining elements of a vector (or array) to yield a single aggregate element. It is commonly used in scientific computations. For instance the inner product of two n-dimensional vectors x, y is given by: This computation requires n multiplications and n-1 additions. The n multiplications are independent from each other, therefore could be executed concurrently. Once the additive terms have been computed they can be summed together to yield the final result. Given the importance of reduction operations in... [More]
Tags: cppcafe upc_programming parallel upc_forall parallel_performance reduction parallel_computing upc |

## Implementing a Scalable Parallel Reduction in Unified Parallel C (part 3)
continue from the second parallel reduction blog . To get better scalability (increased program performance as the number of threads increases), it is critical to remove the lock in the upc_forall loop. This can be done by accumulating the partial sum computed by each thread into a thread-local variable. A thread-local variable is allocated in the private memory space of each thread, thus there are THREADS “instances” of the variable. Each instance of the thread-local variable can be used to accumulate the sum of the array elements having... [More]
Tags: parallel reduction upc upc_forall cppcafe parallel_computing parallel_performance parallel_programming |

## Implementing a Scalable Parallel Reduction in Unified Parallel C (part 2)
continue from the previous parallel reduction blog The result is obvious wrong, but what is the problem? The keen reader might point out that the program as written contains a race condition. Multiple threads can write into shared variable "sum" concurrently, possibly overwriting a partial value previously stored. In order to eliminate the race condition we could protect writes into variable "sum" using a critical section. In UPC this is accomplished by using a "lock" variable as follow: upc_forall ( i=0;i<N;i++;&A[i] ) { upc_lock (... [More]
Tags: reduction parallel_computing cppcafe upc parallel_programming parallel upc_forall parallel_performance |