Implementing a Scalable Parallel Reduction in Unified Parallel C (part 2)
NancyWang 27000303HG Comment (1) Visits (10531)
continue from the previous parallel reduction blog
The result is obvious wrong, but what is the problem? The keen reader might point out that the program as written contains a race condition. Multiple threads can write into shared variable "sum" concurrently, possibly overwriting a partial value previously stored.
In order to eliminate the race condition we could protect writes into variable "sum" using a critical section. In UPC this is accomplished by using a "lock" variable as follow:
The modified version of the program will output the correct result. However what are the implications of this "solution"? The use of the lock effectively serializes the upc_forall loop iterations, preventing any performance gain from parallel execution. To confirm this theory we have measured how long it takes for the upc_forall loop above to compute the sum of the array elements. Our experiments were conducted on a POWER 5 system running AIX5.3 using up to 32 threads (Figure 1).
From the results illustrated in Figure 1 we can infer that the time it takes to execute the upc_forall loop does not improve considerably when the number of threads used to execute the program increases. This is what we expected because the use of the lock in the loop prevents concurrent execution of loop iterations.
How to get better scalability? stay tuned !