This week was the 2010 International Workshop on OpenMP (IWOMP) in the Science City Tsukuba, where we meet annually to showcase the latest research of parallel computing and OpenMP. There is also an OpenMP language committee meeting immediately following which allow us to discuss future enhancements to OpenMP 3.1.
At IWOMP, the keynote was given by Mike Heroux from Sandia National Lab, with two invited talks from Michael Wolfe from PGI and Hans Boehm from HP. Mike Heroux's talk was about mixing MPI with any number of other parallel programming pradigm (OpenMP, TBB, x) to enable a more powerful brand of parallelism. Michael Wolfe's talk was about leveraging accelerator design for the future. Hans Boehm's talk was about ways to make the OpenMP memory model less inconsistent. All three keynotes were excellent talks given by superb speakers. I personally always enjoy listening to these speakers.
There is one part where Hans suggested a sequentially consistent (SC) atomics syntax for OpenMP, so that it can support proper thread communication. I am not convinced this is what OpenMP is used for and as such its current form of relaxed atomics may be acceptable. As long as SC atomics is not the default, I do not object to having additional memory models supported by OpenMP atomics, as it is in C++0x but at some point I worry about how much we are duplicating the base language facility. Hans also left the choice up to us to decide what is appropriate and useful for OpenMP's future. I thank-him for his many insights.
There were many other offers from IWOMP. I checked out the great OpenMP tutorial that is usually run on Monday, by Ruud Van de Pas. This was followed by the reception where we met many people from the conference. A conference in Asia tends to bring in participation from a different area of the world, and this time I met attendees from Korea, China, Japan, Vietnam, and of course the usual friends from Europe and Americas.
The paper presentation portion covered a wide range of research interest including
-Runtime and Optimization
-Scheduling and Performance
-Extension to OpenMP
-Hybrid Programming models
My two talks were well received, or at least there were no embarrassing questions.
The Error model is observed to be needed for OpenMP in order for it to expand into commercial computing. Transactional Memory is based on our alphaWorks compiler and is a potential future addition to OpenMP.
There was a poster session where additional work was exposed and there was a great deal of interest in our poster and its possibilities.
For many of us on the OpenMP language committee, there was significant additional work after IWOMP, as we have language meetings afterward until Saturday. In the language committee meetings, we are trying to close items on OpenMP 3.1 as well as continue work on future items. Bronis de Supinski gave an overview of that near the end of the IWOMP meeting in a panel discussion.
Right now, the major items that we are working on are:
1. User-defined reductions
2. Task Final
4. Improved OpenMP Atomics
5. Some Affinity
6. Memory Model updates
This list could change, but in later posts I will describe in a little more detail what these items are.
For 4.0, we are working on
1. Error Model
IWOMP was closed with a vigorous panel discussion on Exaflop computing. So projection claims that 2019 is the year of Exaflop computing? This is the projection based on previous trends at which various flops would be reached. Petaflop was crossed when IBM's Roadrunner and Cray's Jaguar definitely crossed the line in 2008, well ahead of projection. I wonder if Exaflop will be similar, but consider that it will have to be 1000 times faster then Roadrunner.
Nevertheless, technology often surprises us with their speed and it would not surprise me if we were to reach Exaflop by 2015.
Finally, I want to use this chance to thank our host and his team of tireless members for providing the wonderful facility in Tsukuba, and guiding us along in the city to the many venues, often going beyond the call of duty. They printed posters for us, answered our questions as "Gaijin", and often served as translators in unfamiliar settings. The wireless worked flawlessly and the Epochal International Congress Center in Tsukuba is beautiful.Thank-you Sato-San.
Now it is time to announce the 2011 IWOMP which will be in Chicago on June 13-15. See you there![Read More]
Parallel and Multi-Core Computing with C/C++
Michael_Wong 120000M1EH Tags:  model 2011 japan 2010 tuskuba transactional memory error asia exaflop iwomp 976 Visits
NancyWang 27000303HG Tags:  parallel_performance reduction upc_programming upc parallel_computing parallel upc_forall 992 Visits
A reduction is the process of combining elements of a vector (or array) to yield a single aggregate element. It is commonly used in scientific computations. For instance the inner product of two n-dimensional vectors x, y is given by:
This computation requires n multiplications and n-1 additions. The n multiplications are independent from each other, therefore could be executed concurrently. Once the additive terms have been computed they can be summed together to yield the final result.
Given the importance of reduction operations in scientific algorithms, many parallel languages provide support for user defined reductions. For example, OpenMP provides a user reduction clause to be used in the OMP parallel construct. In the following example the reduction clause indicates that "sum" is a reduction variable:
In the code snippet above each thread performs a portion of the additions that make up the final sum. At the end of the parallel loop, the partial sums are combined into the final result.In this article we will explore different implementations of a global sum reduction in Unified Parallel C, in an attempt to find an efficient and scalable implementation.
Unified Parallel C (UPC) is an extension to the C programming language that allows users to express parallelism in their code. The UPC language subscribes to the Partition Global Address Space programming model. Partitioned Global Address Space (PGAS) languages such as UPC are increasingly seen as a convenient way to enhance programmer productivity for High-Performance Computing (HPC) applications on large-scale machines.
A UPC program is executed by a fixed number of threads (THREADS) which are created at program startup. In general a program statement might be executed concurrently by all threads. To facilitate the distribution and coordination of work among threads UPC provides a rich set of language features. A tutorial on Unified Parallel C is available at http://domino.research.ibm.com/comm/research_projects.nsf/pages/xlupc.confs.html/$FILE/PACT08Tutorial.pdf.
Let's consider how a naive sum reduction could be written in Unified Parallel C:
At lines 5 - 6 we have declared a shared array "A" and a shared variable "sum". At line 10 - 12 we initialize all elements of A to 1. At line 15 - 17 we have attempted to perform the reduction by accumulating the sum of A's elements values in shared variable "sum". Is this program correct?
A possible program output is:
Thread:0,result=27176662,expect=100000000Want to know the answer? stay tuned ! [Read More]
NancyWang 27000303HG Tags:  upc_forall parallel upc reduction parallel_performance parallel_computing parallel_programming 863 Visits
continue from the second parallel reduction blog.
To get better scalability (increased program performance as the number of threads increases), it is critical to remove the lock in the upc_forall loop. This can be done by accumulating the partial sum computed by each thread into a thread-local variable. A thread-local variable is allocated in the private memory space of each thread, thus there are THREADS “instances” of the variable. Each instance of the thread-local variable can be used to accumulate the sum of the array elements having affinity to each thread:
In the code fragment shown above the thread-local variable “partialsum” is used to store the sum of the array elements having affinity to the executing thread (MYTHREAD). For example THREAD 0 will add array elements A, A[THREADS], A[2*THREADS], etc… in its instance of “partialsum”. In order to compute the final result it is necessary to add the “partialsum” contributions from each thread. To avoid a race condition (write-after-write hazard on variable “sum”), we use the UPC lock functions to serialize the accesses on “sum”.
The performance result illustrated by Figure 2 demonstrates that the program is now “scalable”. That is the time taken to compute the reduction diminishes as the number of threads used to execute the program increases. The reason for this improvement is simple: the lock is now acquired THREADS times in total instead of being acquired by each thread in every loop iteration.
In this article, we illustrate the concept of a reduction operation and explained how to implement a parallel reduction in Unified Parallel C. We have shown how the UPC lock primitives can be used to guarantee program correctness. We have then compared the performance of two distinct correct reduction implementations, one using a lock inside a upc_forall loop; the other using thread-local variables to accumulate partial results on each thread. The performance measurements obtained clearly indicates that locks should be used judiciously (if at all) inside loops.
To get the complete version of the document, please go to http://www-949.ibm.com/software/rational/cafe/docs/DOC-3465.[Read More]
NancyWang 27000303HG Tags:  parallel_programming upc_forall reduction parallel_performance parallel_computing parallel upc 877 Visits
continue from the previous parallel reduction blog
The result is obvious wrong, but what is the problem? The keen reader might point out that the program as written contains a race condition. Multiple threads can write into shared variable "sum" concurrently, possibly overwriting a partial value previously stored.
In order to eliminate the race condition we could protect writes into variable "sum" using a critical section. In UPC this is accomplished by using a "lock" variable as follow:
The modified version of the program will output the correct result. However what are the implications of this "solution"? The use of the lock effectively serializes the upc_forall loop iterations, preventing any performance gain from parallel execution. To confirm this theory we have measured how long it takes for the upc_forall loop above to compute the sum of the array elements. Our experiments were conducted on a POWER 5 system running AIX5.3 using up to 32 threads (Figure 1).
From the results illustrated in Figure 1 we can infer that the time it takes to execute the upc_forall loop does not improve considerably when the number of threads used to execute the program increases. This is what we expected because the use of the lock in the loop prevents concurrent execution of loop iterations.
How to get better scalability? stay tuned ! [Read More]
With the recent publication of the SPEC CPU2006 scores of the POWER7-based p780 server, the IBM Power Systems have regained leadership on both the SPECint 2006 and SPECfp 2006 components of this industry benchmark suite.
In particular , the peak FP score of 71.5 is 20% higher than the previous best result.
One of the key features that have enabled this achievement is the use of automatic parallelization technology on several benchmarks. This not only highlights the advanced compilation and optimization technology in these compilers, but is also an indicator of the performance available on these systems when using parallel programming, for example through OpenMP.
Detailed results are available directly from spec.org:
SPEC® SPECint® and SPECfp® are registered trademarks of the Standard Performance Evaluation Corporation. Benchmark results stated above reflect results as of May 18, 2010. For the latest SPEC®CPU2006 benchmark results, visit www.spec.org.[Read More]
Michael_Wong 120000M1EH Tags:  c++ boostcon tm 2010 amino concurrency c++0x 9 Comments 1,615 Visits
Hi, all. I came back from BoostCon2010:
where I delivered three talks and participated in a panel discussion on Transactional Memory, along with such luminaries as Maurice Herlihy (the father of TM), Mark Moir (Sun), and Tatiana Shpeisman, with whom I have worked with for 2 years on the Draft Specification for C++ Transactional Memory.
I gave an update of C++0x, outlining the new schedule for a Final Committee Draft that was voted in at Pittsburgh:
The talks on C+0x Concurrency
was well attended by about 60 people and many clearly were also experts as they asked insightful questions and kept me on my toes. I was thankful that I was able to answer many of the questions.
As I was talking, I went over Clause 29 of N3092 which is our FCD.
Sebastien Redl and others from the audience noticed a number of editorial problems in the Atomics Operations clause, which we will submit as a National Body Comment.
Three other questions were asked at the talk for which I asked Lawrecne Crowl, and here is his answer:
1. Would it be desirableto change ATOMIC_FLAG_INIT to STD_ATOMIC_FLAG_INIT.?
It already seems pretty long and the other macros all start with ATOMIC.
2. Can the result of is_lock_free() return a different value over
time (I know they can change per iinstance)? But we though may
be i you ahve a distributed architecture, then is_lock_free()
may be different over time.
We later changed the definition to require lock_free to be consistent across objects.
The function atomic_is_lock_free (29.6) indicates whether the
object is lock-free. In any given program execution, the result
of the lock-free query shall be consistent for all pointers of
the same type.
The problem was that you need to test before you allocate.
3. Can compilers diagnose misuse of atomic types in a lock-free
contexts when a type isn't lock free for a given platform?
I think compiler can only know when it might not be lock-free,
which would probably be sufficient. All the locking operations will
likely be in a dynamic library that gets overridden to lock-free
operations on platforms that support them.
Wednesday brought with it a special Transactional Memory (TM) day organized by Justin Gottschlich whom I met last year, and is doing some interesting work on releasing a TM library for Boost.
The keynote from Maurice Herlihy, gave a world-wind tour of Transactional Memory today. He highlighted the key papers of the last decade which has lead to the recent growth in TM.
He described the usual issues with synchronization overhead, privatization, lock elision, the Gartner fad cycle, while drawing a closer look the reason for some of the most vocal proponent and opponent of TM. It ended with a delivery by each of Sun, IBM and Intel on their TM offerings. I chose to spend my part talking about the different nature of the TM runtime which can be configured:
This was followed by a panel discussion where I joined with Maurice Herlihy, Mark Moir from Sun/Oracle, and Tatiana Shpeisman from Intel to describe our view of the future of TM. The panel got very lively on one topic, which was what is the one thing that TM would need in the next 2 years.
Almost everyone agree that it would have to be some kind of application conversion to TM, or using TM. It was not clear whether converting an existing application to TM would actually yield any significant benefit given that TM still has problems in several areas of common usage idiom as compared to such usage in locks. Some common cases that TM don't do well are IO, locks, dynamic libraries, or even time delays. (Imagine a time delay in an atomic transaction). These are common usage idioms inside locks today. Here is a list from Paul McKenney outlining these issues:
Until TM can come up with a way to deal with these common usage idioms used today in locks, any conversion to TM will meet with the difficulties that the conversion to TxLinux met, where the they converted only some cases of locks to TM, but others had to be left as locks.
This means that TM, as some researcher suggest may be relegated to a narrowly defined domain where code does not use any such questionable operations. This is not an unappealing answer either, and it certainly simplifies the unrealistic expectation placed on TM.
As part of this TM day, we also unveiled the first public talk on the Draft Transactional Memory Specification, given by Tatiana Shpeisman of Intel.
As usual BoostCon is full of talks that opens your eyes on the possibility of C++ and Boost. Last year, I had suggested that there should be a special parallel programming day for Boost, and Justin made that a possibility. Good work Justin and thanks for inviting Maurice Herlihy, who is a fantastic guest and speaker.
As I look back, I can see the potential of Boost becoming the premier C++ Conference. It is now not just about Boost, as we are looking at all things C++, including what is coming for C++ in terms of 0x and TM. In is not hard to see a future where people will get even more from BoostCon beside Boost. I know a segment of the Boost Community is resisting that, and I don't blame them. Right now, the size as some would say is just right to have a community feel. But others would argue for expansion and growth, and the benefit that would be derived from such growth.
As usual, I always want to go to as many talks as I can because BoostCon is not just about learning, but reconnecting with people who are active in the C++ community. I saw Marshall Clow whose company was generous enough to provided some sponsorship (I heard), Nevin Lieber, Stepen Lavavej, my good friend Christophe Henry whom I hung out with afterward on a visit to a few National Park, the wonderfully funny Jeff Garland, the incredibly informative Doug Gregor, and of course the wonderful and knowledgeable Joel de Guzman, and Hartmut kaiser. New people I met including Dean Michael Berris whose blog I read from afar , and Robert Ramey because we have been working on his Serialization Library. And of course, we must not forget thanking David Abraham for being instrumental in starting such a conference as BoostCon.
I went to Eric Lieber's Proto talk and continue to marvel at the simplicity that allows C++ to be a Domain Specific Embedded language to be used by other DSELs. I thought he was a powerful speaker. He helped us with some of our Proto tests. Another good speaker is Michael Caisse who talked about Spirit, Karma, and Qi, all built upon Proto. I was intrigued by Joel Falcou's (aka Crazy Frenchmen:) Numeric Template library which uses IBM's PowerPC Hardware and others to build a Fortran-like array library using templates. This fills a niche that I find many high-performance users are looking for as they move from Fortran to C++, which is a lack of a drop-in replacement for the Fortran array syntax in C++ (Vectors still doesn't quite do it). Currently, it uses G++ on PPC (as well as a number of other hardware/compiler combination) but I would be eager to try it out with our xlC++ compiler when that library is released on Boost. They were not the only people who use IBM's PPC that I encountered and frankly the number of IBM customers I get to meet here makes this trip extremely gratifying. As usual, I can't say who they are, but you know who you are and thank-you for coming up and saying hello afterward.
Delivering three talks and being part of an Expert Panel was a bit much and made me less able to enjoy BoostCon thoroughly by attending others talks this year as I did before. However, the reaction I got was terrific and this continues to urge me to give more to the community.[Read More]
There are varieties of tools that work with the IBM XL compilers. Some help productivity in the development phase (IBM debugger, RDp), some help exploit the architecture characteristics (compiler report) and some help utilize the hardware.
The IBM Parallel Environment program product (PE) is a distributed memory message passing system supported on AIX and Linux. This is a separate IBM product. For detail information, refer to http://www-03.ibm.com/systems/software/parallel/index.html .
PE is designed for developing and executing parallel Fortran, C, or C++ programs. PE supports the two basic parallel programming models – SPMD and MPMD. In the SPMD (Single Program Multiple Data) model, the same program is running as each parallel task. The tasks, however, work on different sets of data. In the MPMD (Multiple Program Multiple Data) model, each task may be running a different program.
Let’s talk about how XL compilers and PE work together to exploit the parallel computing environment. A set of invocation commands are provided in PE for compiling programs that are executed in the parallel environment. The invocation command invokes the XL compiler with specific options and links in special libraries (Partition Manager and message passing interface libraries) for executing in the environment. The name of these invocation commands starts with “mp”, for example, mpcc_r for C program, mpxlf90_r for Fortran program and mpCC_r for C++ program. For executing the program in the parallel environment, the poe command is also provided to invoke the Parallel Operating Environment (POE) for loading and executing programs on remote processor nodes.
We will walk thru a few steps with a simple program to illustrate how XL compilers and PE work together. In this example, we have a C program (main.c) that calls a Fortran procedure (arr_cal.f90) for computation and then print the result in main,
On AIX, the following commands are used to compile and link the program.
The executable can be executed on a cluster of machines by using poe command. Before using the poe command, a host file needs to be created to specify on which hosts the program is executed. In addition, the same directory (with the same absolute path as the current directory on the local host) has to be created on all the remote hosts.
The command to execute the program on the listed hosts is shown as follow:
The option –procs is to specify how many tasks are created. In this case, four tasks are created to execute the same program (test1) on different hosts as specified in the host file (host.list). What the mcp command does is to copy the executable to the remote hosts. The option –labelio specifies that the output from the parallel tasks is labeled by task id.
This example is a simple program to illustrate how the XL compilers work with PE. If you need to develop a program that exploits a distributed environment, PE is an essential tool. PE also provides a parallel debugger (pdb) for debugging parallel programs.
In this blog, we briefly describe the PE product and how it is used with the XL compilers for exploiting the parallel environment. Of course, the program can be much more complicated and useful than the simple one discussed here. Some applications decompose the problem to smaller size and distribute it to different hosts to work on. After the work finishes, the application collects the data from different hosts for the final result. In addition, the poe command is demonstrated here to use for executing the program on any remote hosts.[Read More]
Its a special time of the year in more then one way.
In addition to getting together with your family and friends, it is also time to consider attending/leading sessions or writing papers, in several parallel programming conferences.
IWOMP 2010 is the premier conference for OpenMP and shared memory parallelism on C, C++ and Fortran.
It usually consists of user tutorials, paper presentations on what is coming next for OpenMP, and current ongoing research in shared memory parallelism, as well tools presentations.
BoostCon 2010 is a C++ conference on Boost, which is an Open Source library written by Boost experts. This year, we are planning a special session on Transactional Memory including a very special guest.
I have been the Program Committee for both of these and enjoy their development at driving new directions in these fields.[Read More]
Michael_Wong 120000M1EH Tags:  concurrency stm c/c++ blocks building transactional_memory amino 888 Visits
Transactional Memory (TM) is a high level abstraction for supporting a safe mutable shared state, such that the user does not have to worry about the low-level details of locking and sharing of global resources. It is basically a class of optimistic speculation techniques such that groups of memory operations are bundled as an atomic operation such that it can resolves the problems with locks, possibly support composability.
The basic idea is to move your group of atomic operations through, assuming that it will be successful, and only rollback when a conflict actually occurs.
At the moment, much of the ideas of TM are there as a way to test out the idea, and possibly be integrated into some future hybrid system. Even practitioners of TM knows there is a certain amount of hype that we have to deal with in any new technology before it drops to a trough and rebounds back to a realistic plateau.
A number of vendors have planned both hardware and software implementations of Transactional Memory.
The software transactional memory compilers from different vendors all use different syntax, and this creates a basic problem with interoperability, and common porting of code. I will deal with this in the next post.
IBM is also working in this area, and has released an Alphaworks compiler supporting Software Transaction, actually last year.
The IBM XL C/C++ for Transactional Memory for AIX is also accessible from the Resource Library of the C/C++ Cafe.
The public domain STM runtime is compatible with the AlphaWorks XLC STM release.
It was released through the Amino Concurrency Building Blocks project
The source code is here
I will have more to say about the Amino Building Blocks in a future post.[Read More]
Hello all, over the last year, a group of Transactional Memory experts from Sun, Intel, and IBM have been getting together every Friday to discuss how to create a uniform syntax for Transactional Memory.
We are happy to release the first version of the Draft Specification of Transactional Language Constructs for C++. This specification is the result of a joint work by a group of people from Intel, IBM and Sun, and is based on our experience working with transactional language constructs. We would like to encourage people to implement this specification and we welcome feedback on the document. Please direct any such feedback to this discussion group TM & Languages.
You can find the specification in the Resource Library/Articles, Presentation, Ebooks under the Parallel Section
If you have any comments, I invite you to post them here or in the Discussion forum.
TM Specification Drafting Group[Read More]
Michael_Wong 120000M1EH Tags:  conference boostcon transactional_memory c++ parallel_programming cmake boost 2 Comments 1,097 Visits
I apologize for lack of updates recently as an addition to the family has kept me hopping.
I have still been keeping up my parallel programming work by a recent talk on C++0x Multithreading at BoostCon 09:
My two talks (the other one was an overview of C++0x and compiler support which can be seen here:
http://www-949.ibm.com/software/rational/cafe/blogs/cpp-standard/2009/05/26/the-view-or-trip-report-from-the-mar-2009-c-standard-meeting) seem packed with about 60 people in an auditorium for about 90 minutes. Here is a trip report from a fellow speaker Justin Gottschlich who attended my talks:
The slides and video should be online soon.
A second trip report review is by Emil Dotchevski:
Justin gave an excellent talk on a proposed Boost Transactional Memory Library which he is collaborating with the Father of TM: Maurice Herlihy.
I have been involved in Transactional Memory for a few years now and knows its hype, promise and pitfalls. Still I felt I was stirred by Justin's excellent oratory skills, pitching this technology. The idea of doing TM as a pure library is not easy, although it does get the technology into the hands of everyone as fast as possible. Language changes take time to get right. I should know and will discuss in a future post.
Please read their excellent trip reports for the details. I will turn my discussion on something else that is also interesting to me personally, but may not have much to do with Parallel programming.
BoostCon 09, as with previous BoostCon was an exciting experience. Without intentionally touting my own horn, BoostCon09 is rich with speakers of experience in the field. They choose their speaker carefully from the pantheon of C++. Last year was Bjarne Stroustrup. This year was Andrei Alexandrescu, whose topic is Iterators Must Go.
Andrei gave many reasons why iterators, once a good idea, are unsuitable as we move forward. For me, the most interesting argument is that iterators are not well suited to multithreaded programming, because many of the idea of stack pop and push can only work in single thread unless we change the interface.
BoostCon, in my opinion is rapidly becoming the leading C++ conference, in direct competition with SD West, and ACCU. All are packed with workshops, and knowledgeable speakers.
This year, there was a distinct track of parallel programming theme, which included:
All this makes me want to suggest a special track for parallel programming for Boost in future years.
I attended some of the 0x tutorials and found that I still had things that I didn't know. This is not surprising given the depth of C++0x.
The other interesting part was the Cmake tutorial. Boost build has used bjam since the beginning. This build tool, while interesting in its own right has many peculiarities which makes its adoption not a trivial task for building Boost on IBM systems. From the stories we heard around Boostcon, the same seems to apply to other environments.
Recently, there has been a move towards using cmake as a truly better build tool. From what I can see from the workshop, they are right and I am eager to move our Boost build to a cmake system, especially if Boost is moving in that direction, to rid us of the problems that bjam has caused.
How does cmake differ form the traditional unix make?
Unlike make, it does not actually do the software build, but instead generates standard build scripts (makefiles on Unix, project file for Windows, workspaces for Eclipse/CDT), that makes it easier to adopt to various systems.
Cmake was started as part of the Insight visual ToolKit build from Kitware, and has since migrated into many products. The real explosion occurred with adoption of cmake by KDE. Since then, even more software is converting to cmake.
Cmake is so far able to build Boost, but is not able yet to run its builtin tests. This is a problem that I am sure will be rectified.
The beauty is that cmake is available already as a binary on AIX systems.
I worked with Troy Straszheim and Brad King to try getting cmake working on our Boost build. I got half way but found a problem which I hope to resolve.
We support Boost because IBM AIX and Linux xlC++ compilers have been tested with Boost with support in V8 with 1.32, then V9 with 1.34, and V10.1 with 1.34.1.
You can see our Boost test results here:
The following is a a private communication from an IBM engineer Matthew Markland who asked a great question. I do not claim great expertise but I feel that there is enough of an opinion piece that some folks may like to see this discussion or continue it. I have edited the response somewhat but it is largely in tact and reprinted with Matthew's permission. Note I have no insight into PGI or any other product other then what I read in public articles, and as such makes no product claim. Any opinion regarding other company remain necessarily my own and is not IBM's position.
Michael:Please join in the discussion, or even bring this up as other experts will chime in.
I hope that the new year finds everything well for you and yours.
I'm enjoying the C/C++ Cafe posts you guys put out immensely.
I just wanted to get your opinion on some things that have been
going through my mind with respect to the multicore/hybrid
programming models that are being put out by various entities. It
seems that many people believe that the best model is an extension
to the language model, be it a pure language extension like what
CUDA and OpenCL have, or with a new model of pragmas like PGI is
OpenCL/CUDA is mostly a library based model and a language extension(modulo the 4 memory annotations). But yes I see where you are going with this ...
adding. I'm wondering, especially in the case of the PGI extensions,I am assuming this is the pragma directives available in their technology preview:
whether they make sense given the existing OpenMP specSo there has been parallel languages that are directive based, language extensions, and library based. Usually they start off with library based because they are easy to port, and works on many vendors' compiler. Language-based solutions are harder to implement, and can not be easily corrected if wrong. Directive-based like OpenMP makes it easily adapted in an incremental manner, and keeps the base program running even on platforms that don't accept the directive. Today, we have examples of all three. MPI is a pure library based solution. Cilk is a pure language based solution and OpenMP is a directive-based solution (although it too has a library part).
Where do you
A mostly library based language like OpenCL is in a sense a step backwards. So PGI is trying a directive based approach to send the computational kernel to the accelerator/GPGPU. This is a bet from their part. I am familiar with their chief compiler engineer on the OpenMP Committee Michael Wolfe, and respects his opinion.
see this headed from a personal perspective.
Having some involvement in OpenCL, I can see where it falls somewhat short, but is nevertheless a tremendous accomplishment. It is designed for today's GPGPU architecture, assumes a weak memory model, implicitly have a dual layer of scheduling policy between the host (outer asynchronous layer) and the thread processors (inner synchronous processors with local memory). This is in addition to it being still relatively hard to program,( though easier then DirectX or OpenGL) and for people who have to port a 100,000 line of code is a large commitment on a technology that may not be around. OpenCL, is still a stream processing language and as such is limited in the scope of the parallel programs it can speed-up. What PGI is probably looking for is a more generalized programming model which works in broader situation. That is why they introduced the scheduling clause, and tied it to OpenMP. I would not be surprised if some kind of heterogenous programming support would be in OpenMP in future.
I don't have any significant personal insight but also is involved in adapting the OpenMP paradigm to fit in the next programming model without knowing where to go.
In the end (and this is based on Michael Wolfe's excellent analogy in an HPC paper), OpenCL is basically designed for a hardware that is a large wide body air carrier that can handle massive number of passengers in one run, but requires special airport transportation to get the passengers to the plane because the plane doesn't fit in the terminal. So the speed it has (in terms of # of passenger-miles) is mitigated by the wait time (DMA access)of loading the plane. It works when everything fits.
If you don't have that many passengers, or have a variable number of passengers, it doesn't buy you any extra benefit and may penalize you with a super wide-body jet. And there are lots of other kinds of air carriers out there, including the super-fast kind for the payload just has to get there by 9 am the next day and the medium sized ones that can carry your particular amount of load.
As such, there will still be a place for OpenMP, MPI, TBB, futures, UPC, TM. We are suffering under an alarming number of these so-called parallel languages/extension/libraries lately and I can only see more as we all search for the right model. At one point, we had the same in terms of sequential languages, and over time we have dwindled down into a few General Purpose languages with many domain-specific languages. The same will likely happen in the parallel language world.[Read More]
One of the most important thing that happened in the last month of 2008, was the release of the OpenCL specification by Khronos:
An in-depth overview which breaksdown by Clause of the specification shows some of the capabilities of this specification. A shorter summary will provide an overview.
IBM is part of the group that wrote this specification.
What is OpenCL?
The original intent of OpenCL was to raise the abstraction of graphics programming. Game programmers will recall the battle for graphic programming using DirectX and OpenGL, which are specialized graphics languages. Now they can speak OpenCL and never have to learn these specialized graphics languages.
NVIDIA's vendor-specialized language CUDA was meant to do this using C, and to some extent, AMD's Close-To-Metal, and of course Microsoft's DIRECTX 11 Compute.. But an open, royal-free specification makes it far easier for everyone to invest their programming time to this. Apple, AMD, NVIDIA, and Intel are also members who participated in this specification. I wonder if Intel's Larrabee will support this.
For parallel programming, this specification enables both data-based and task-based parallelism. It enables programmers to exploit the powers of Graphics Processing Units as General Computing Devices (GPGPU), which has recently been known to give significant speedup in specific applications.
Unlike other parallel languages, OpenCL is aimed towards supporting heterogenous computing.[Read More]
Let's get to the good stuff. OpenMP 2.5 did not really specify what constructors should be called with various private/first/lastprivate/threadprivate:
In some cases, it did not even specify that they should apply to non-PODs (Plain'ol Data, i.e. C structs).
OpenMP 3.0 changed that. Beside specifying non-PODs, it also specified precise rules for the constructor sequence that is in line with what the semantics would require.
For instance, it would specify that a firstprivate for a class type variable should expect an accessible copy constructor, since it is required to initialize each of the one or more list items private to a thread with the value that the corresponding original item has when the construct is encountered.
For a class type variable, it requires an accessible, unambiguous copy assignment operator for the class type. And it requires an accessible, unambiguous default constructor for the class type unless the variable is also specified in a firstprivate clause.
This is the most interesting as it differentiates three kinds of initializaton in C++.
1. Without initialization: Object1 o;
The semantics of this is that for the master thread, global static objects and static class members are constructed before main() is entered in an undefined order.
For the slave threads, the exact point in time of object construction is unspecified, but is has to happen before the thread references it the first time
These changes were a long time coming. It causes what used to be vaguely implied by the 2.5 specification, now to be clearly specified so all compilers can conform. It also allows the users of C++ with OpenMP to have more consistent behavior.[Read More]
This part will talk about enabling threadprivatization of static class member variables.
In 2.5, as a result of ambiguous language, the support for this was inconsistent. In general, it would claim that a threadprivate variable must be namespace, file or block scope.
In 3.0, this code is now allowed:
This may seem a trivial change, but for C++, it enables a powerful idiom of singletons and allocators, which all rely on static class member variable.
Next posting will continue with the semantics of private, first/lastprivate, threadprivate+copyin/copyprivate for C++.[Read More]
This was one of the question asked at SC 08. I will try to answer that here. I will start and add more as I move through the various topics.
OpenMP 3.0 had better support for C++ in the following areas:
For-Worksharing with Iterator loops:
We specifically enabled C++ RandomAccess iterators and C pointeres to be parallelized with explicit directives.
I will follow up with more examples.[Read More]
I should include an obligatory photo of the OpenMP Booth with our CEO Larry Meadows. Many members worked tirelessly, especially Larry at making sure the Booth was put up, taken down and staffed properly.
We also met many people who specifically dropped by to see what was going on, in this first year where we had a booth.
Of course, we are very proud that the new list for TOP500 computers has been released here at SC08 on Tuesday at an evening BOF.
This is the list of the fastest supercomputers in the world. IBM's Roadrunner has been on this list for some time.
The rumor before the conference was whether Oakridge's Cray Jaguar would catch a RoadRunner. Now we know that a Jaguar is not faster then a RoadRunner.
Also of importance in today's energy conscious world, is that IBM's RoadRunner is also #3 on the TOP500 GREEN list of supercomputers:
The OpenMP Birds-of-a-Feather session at SC08 was very well attended. The room was full to overflowing, with approximately 60-80 people. while OpenMP had BOFs at SC in prior years, this is actually the first year that OpenMP has had a Booth on the Exhibitor floor as well.
The BOF had many elements, including what is new in OpenMP 3.0. They were:
1. Welcome and summary of ARB news
- 5 mins
- Larry Meadows
2. The three greatest things about OpenMP 3.0 and the three most
important things left out of OpenMP 3.0
- 15 mins
- Tim Mattson
- 10 to 15 mins
- Alex Duran
4. Extending the OpenMP profiling API for OpenMP 3.0
- 10 to 15 mins
- Oleg Mazurov
- 5 mins total
- IWOMP'09 announcement - Matthias Mueller
- OpenMP book examples - Ruud van der Pas
6. Panel "How to kill OpenMP by 2011"
- 35 mins
Bronis de Supinski
7. Wrap up
- 5 mins
- Larry Meadows
For a description, you can find lots of detail here, including a downloadable summary card of the 3.0 Specification.
I met a number of people during my afternoon session manning the booth on Wednesday 2-6 pm, including one of the original founder of OpenMP, an instructor on using OpenMP in graduate courses as well as a consultant, among many others. I invite them all to drop me a note here so we can continue our discussion.
One thing that was surprising to many folks was the many compilers that already have 3.0 implementation, despite the specification ratifying only in May 2008. The companies with 3.0 implementation includes IBM, Sun, PGI, and soon Intel. Note I am just reading this from the net and have no insight into other company's or even my own company's release schedule. GNU should have something by 4.4. Same disclaimer.
One thing that happened at the OpenMP BOF was a panel discussion on How to Kill OpenMP by 2012. This is kind of a fun session, especially at the end of a long day to not take yourself too seriously. The point is to showcase all the wrong ways in spreading a de-facto standard.
I have had some experience working with language designs through my various roles as standard rep, and compiler writer. So I thought I would give my $0.02 here on How to Kill OpenMP by 2012:
10. Don't implement the specification as stated.
9. Make it impossible to nail down ambiguities by having no way of addressing defects.
8. Ignore the user forum or suggestions
7. Add everyone's favorite feature, no matter how marginally useful.
6. Make the process as non-transparent as possible, so no one knows when you are ratifying, or even what you are doing.
5. Debate endlessly, on anything, not necessarily having anything to do with the language.
4. Design a feature as completely as possible before releasing it
3. Make no concession when taking a stance on objecting to someone's feature.
2. Form close-nit elitist groups and follow the NIH syndrome.
1. Don't organize any meetings, and when there are meetings, don't follow any rules.
Seriously, I have not found this to be a problem in any of the committees that I am a part of. Otherwise, we would not have made any progress. But we can always do better.[Read More]
Michael_Wong 120000M1EH Tags:  supercomputing conference openmp c tutorial volatile c++ 826 Visits
On Sunday, I attended the OpenMP Tutorial hosted by some of my colleagues from the OpenMP committee, Tim Mattson and Larry Meadows. It was an excellent tutorial, full of features and pointing out the behavioral differences between OpenMP 2.5 and 3.0. Tim has been doing this tutorial for almost 20 years and Larry, as the OpenMP CEO has been involved in this for the same amount of time.
I must point out that IBM xl C/C++ 10.1 compiler for AIX and Linux both have support for OpenMP 3.0, in addiiton to the compilers mentioned during the tutorial. In fact, this compiler was one of the first to have support for OpenMP 3.0 in the industry since the ratification of the specification in May, 2008.
You can get a trial version from the link above.
There were a few C++ specific questions. During the discussion on OpenMP memory model, one questions involves how volatile in C/C++ is used in multithreaded program.
The C++ 0x Standard will clarify that the volatile keyword continues to have nothing to do with multithreading. It merely indicates that something from the environment may change the value. The role of an atomic variable will be used to indicate that another thread may change the value of this variable. It is possible to say that a variable is an atomic volatile to indicate that something from the environment AND another thread may change the value. The C standard, when it supports concurrency in C1x will likely adapt the same meaning.
So in short, C++ Volatile !=Jave volatile and C++ Volatile!= C++ atomic. During C++0x concurrency deliberations, we discussed at length and felt that there is too much history and presumed meanings involved with volatile. In fact, it may even have different meanings varying from compilers to platforms. So it was felt wiser to leave it alone, allowing implementers to retain whatever meaning they are used to, and adapt a new way of naming truly atomic types.
I hope this helps to answer one of the question regarding the role of volatile.
There was another question about how OpenMP 3.0 changed the role of what constructor (default, copy, assignment) is called by each of the private/threadprivate/first/lastprivate variable. Tim was right that we worked very hard to clarify this aspect, but the rules while somewhat intuitive is still fairly tedious to enumerate, given the many ways that C++ does initialization. I will clarify that in a subsequent post.[Read More]
Hi, all. This is Michael Wong. I am at the International Conference for High Performance Computing, Networking, Storage and Analysis, commonly called SC 08, courtesy of IBM
I am specifically here to participate in the OpenMP sessions, where we have a booth, a BOF, and other discussion panels.
Over the next few posts, I hope to bring you some of the exciting events that are going on at this conference.
Looking at the program, there are a number of sessions aimed at Parallel Computing. If there are sessions you would like me to report from, drop me a note through this post and I can see what I can do.
On the flight down, let me just say that the last direct flight from Toronto (where I come from) to Austin was like a reunion of technologists as nearly everyone on the plane was attending SC 08. I recognized people from almost every software/hardware company that is active in this area. It was an interesting high-tech flight.
Normally, there is no direct flight from Toronto to Austin, and it takes multiple stops to get here. I am sure they didn't put together a direct flight just for us for SC 08.
(Actually, I learned on the plane that the direct flight started a year ago by Air Canada!)[Read More]
TANSTAAFL="There ain't no such thing as a free lunch" with apologies to Robert Heinlein in "The Moon is a Harsh Mistress"
One noted author quotes this in reference to the end of the free
performance improvement that Moore's Law offers on single chip clock
speed every 18 months.
This heralds performance improvement in future will need to do more
then just sit on your hands waiting for the next clock speedup, but
that you have to program in latent parallelism into your code to take
advantage of more cores as they come out.
This requires a new mind set, in everything.
Hi, my name is Michael Wong and I am IBM's OpenMP and C++ Standard
representative. I and my colleagues will bring through this blog some
of the key industry trends related to Parallel and Multi-Core
Computing. We have been involved through our roles as implementers,
academic liasions, representatives on various standard committees and
The fact of it is that times have never been better for Parallel &
Multi-Core Computing, with almost every standard, specification,
manufacturer coming out with support for multicores in programming
models, applications, and hardware. Some recent examples are Cell,
OpenMP 3.0, new Java Memory model, UPC and co-array Fortran, C++0x
concurrency, Cilk, transactional memory, auto-parallelization, Altivec and various proprietary or research projects
There is no question that multi-core is a renaissance in systems and
processor design. But it is also a software issue with implications
throughout the entire software development stack.
So come back often and see how this story evolves.[Read More]