with Tags:
parallel_performance
X

## Assist Threads Part 1
Today’s multi-core processors support many threads of execution and can provide substantial performance when running multithreaded applications. Unfortunately, multithreaded programming is difficult, and as a result, a lot of today’s software is still single-threaded. Single threaded applications only exploit a single core and leave a large portion of the processor unutilized. The current microprocessor architectures pose another challenge – the memory latencies are rising relative to the clock speed of processors. This causes a core to pay a... [More]
Tags: parallel_performance parallel cppcafe concurrency parallel_computing |

## Implementing a Scalable Parallel Reduction in Unified Parallel C
A reduction is the process of combining elements of a vector (or array) to yield a single aggregate element. It is commonly used in scientific computations. For instance the inner product of two n-dimensional vectors x, y is given by: This computation requires n multiplications and n-1 additions. The n multiplications are independent from each other, therefore could be executed concurrently. Once the additive terms have been computed they can be summed together to yield the final result. Given the importance of reduction operations in... [More]
Tags: cppcafe upc_programming parallel upc_forall parallel_performance reduction parallel_computing upc |

## Implementing a Scalable Parallel Reduction in Unified Parallel C (part 3)
continue from the second parallel reduction blog . To get better scalability (increased program performance as the number of threads increases), it is critical to remove the lock in the upc_forall loop. This can be done by accumulating the partial sum computed by each thread into a thread-local variable. A thread-local variable is allocated in the private memory space of each thread, thus there are THREADS “instances” of the variable. Each instance of the thread-local variable can be used to accumulate the sum of the array elements having... [More]
Tags: parallel reduction upc upc_forall cppcafe parallel_computing parallel_performance parallel_programming |

## Implementing a Scalable Parallel Reduction in Unified Parallel C (part 2)
continue from the previous parallel reduction blog The result is obvious wrong, but what is the problem? The keen reader might point out that the program as written contains a race condition. Multiple threads can write into shared variable "sum" concurrently, possibly overwriting a partial value previously stored. In order to eliminate the race condition we could protect writes into variable "sum" using a critical section. In UPC this is accomplished by using a "lock" variable as follow: upc_forall ( i=0;i<N;i++;&A[i] ) { upc_lock (... [More]
Tags: reduction parallel_computing cppcafe upc parallel_programming parallel upc_forall parallel_performance |