Parallelize C/C++ code on z/OS with OpenMP

Optimize applications on multicore architectures

The new XLC C/C++ compiler Version 2.1 for z/OS offers support for the OpenMP 3.1 standard for parallel programs. This article presents a high level glimpse of this feature and provides simple examples on how to use all available OpenMP constructs.

John R Barboza (jbarboza@ca.ibm.com), Compiler Testing and Validation, IBM

Author1 photoJohn is responsible for testing OpenMP 3.1 on the z/OS C/C++ Compiler V2R1. John has worked for IBM since 2012.



08 January 2014

Why use OpenMP

Parallelizing C/C++ code improves runtime speed and efficiency by distributing independent execution tasks to as many CPU cores as possible. The improvements in speed and efficiency are proportional to the number of CPU cores available.

OpenMP allows a programmer to gain the benefits of parallel code without doing all the work required to setup a parallel environment. These benefits include thread creation, work distribution, resource management, and more. With OpenMP code readability is almost as good as a single threaded application but simultaneously performs at the rate of a parallelized application.


How OpenMP works

OpenMP allows you to describe certain sections of the code as "parallel". A parallel section is run by every thread in the OpenMP thread space. The non-parallel sections are run by the same thread that started the program. This is also known as the Master thread. The number of threads in the thread space is specified either in code by the developer or by the environment during program startup. Figure 1 illustrates how this works for a parallelization factor of 4 threads.

Figure 1. How OpenMP works
How OpenMP works

Creating a parallel section is simple, it involves creating an "omp parallel" block. All executable code within this block is run by every thread in the thread team. Listing 1 shows an application that prints "Hello World" using an "omp parallel" block

Listing 1. Basic HelloWorld
#include <iostream>
int main()
{
  #pragma omp parallel
  {
    std::cout << "Hello World!\n";
  }
  std::cout << "Non-parallel hello world!\n";
}

Listing 2 is the output of Listing 1 with 4 parallel threads.

Listing 2. HelloWorld output
Hello World!
Hello World!
Hello World!
Hello World!
Non-parallel hello world!

OpenMP constructs on z/OS C/C++

There are many aspects to OpenMP. OpenMP takes care of:

  • The creation of threads and distribution of tasks to threads
  • Creation of a thread-private stack that stores thread-private variables and
  • The synchronization of concurrent threads to prevent race conditions and other well known challenges that arise from parallel programming practices.

The sections below explain how to utilize these aspects of OpenMP.

Create threads and distribute work

Use the construct in Listing 3 to create parallel sections that are run by each thread in the thread team.

Listing 3. Parallel directive
#pragma omp parallel
{
	// parallel code here is run
	// by every thread in the team
}

Parallel threads don't have to execute the same block of code. You can divide the code into "sections" that can run by any one thread. These sections are run in parallel. Threads do not proceed beyond the "sections" block until each thread has finished running within that "sections" block, as shown in Listing 4.

Listing 4. Section directive
#pragma omp parallel
{
	// executed by every thread in the team
	#pragma omp sections
	{
		#pragma omp section
		{
			// executed by any one thread in the team
		}
		#pragma omp section
		{
			// executed by any one thread in the team
		}
	}
	// all threads are in sync at this point	

	// executed by every thread in the team
}

If a parallel section only contains sections, you can combine the parallel and sections construct to improve code readability, as shown in Listing 5.

Listing 5. Combined parallel section directive
#pragma omp parallel sections
{
	#pragma omp section
	{
		// executed by any one thread in the team
	}
	#pragma omp section
	{
		// executed by any one thread in the team
	}
}

Sometimes you may want to start a certain block of code by only one thread in a parallel section. For example, you might want to print the status of a global variable once. To do this, use "single" instead of "section". There is no thread synchronization as with the "sections" block. Therefore, the single block doesn't hold up any other threads running parallel to it. Listing 6 provides an example of this.

Listing 6. Single directive
#pragma omp parallel
{
	// parallel code here is run
	// by every thread in the team
	#pragma omp single
	{
		// code run by any one thread in the team
	}
	// threads are not necessarily in sync here
}

The master thread, and only the master thread, can be forced to run a block of code using the master construct shown in Listing 7. This is just like the single block but forces the master thread to execute the block.

Listing 7. Master directive
#pragma omp parallel
{
	// parallel code here is run
	// by every thread in the team
	#pragma omp master
	{
		// code run by only the master thread
	}
}

You can parallelize a "for" loop using the for construct. The means that the iterations of the loop are distributed equally among all thread members in the team as shown in Listing 8.

Listing 8. Fordirective
#pragma omp parallel
{
	// parallel code here is run
	// by every thread in the team

	#pragma omp for
	for(...)		// regular for loop
	{
		...
	}
}

If your parallel section only contains a parallel "for" loop, you can combine the two constructs as shown in Listing 9.

Listing 9. Combined “parallel for” directive
#pragma omp parallel for
for(...)		// regular for loop
{
	...
}

Remember that the loop iterations are not run in order of iteration number. This means iteration 9 (run by its own thread) can run before iteration 6 (which is also run in its own thread). However, there might be sections of code that you want to run in sequential order just as in a non-parallel loop. You can do this by wrapping that block of code in an ordered block. You also need to specify the ordered clause for the enclosing parallel "for" directive.

Listing 10. Ordered block inside a “for” block using an “ordered” clause
#pragma omp parallel
{
	// parallel code here is run
	// by every thread in the team

	#pragma omp for ordered
	for(...)		// regular for loop
	{
		// code run non-sequentially
		
		#pragma omp ordered
		{
			// code run sequentially
		}
		
		// code run non-sequentially
	}
}

Thread private variables

You may want to declare a variable outside of a parallel section but treat it as a thread private variable inside of a parallel section. This means that every thread in the parallel region has its own version of that variable. To do this, use the private clause as shown in Listing 11.

Listing 11. Private clause
int var1, var2 ...;
#pragma omp parallel private(var1, var2, ...)
{
	// parallel code here that uses var1, var2
}

In Listing 11, the thread specific variables are initialized like any other int variables. If you want to initialize these variables with the value of the global variable, use the firstprivate clause shown in Listing 12.

Listing 12. Firstprivate clause
int var1, var2 ...;
int var3 = 8;
#pragma omp parallel private(var1, var2, …) firstprivate (var3)
{
	// var3 will have initial value of 8 in this section	
}

If you want to update the global variable with the private variable from the last thread that finished the parallel section,use the lastprivate clause.

Listing 13. Lastprivate clause
int var3 = 8;
#pragma omp parallel private(var1, var, …) lastprivate (var3)
{
	// parallel section
	// assume that the last thread that finishes this section is thread # 4
	// assume that var3 in thread # 4 has final value of 1
}
// var3 will have the value 1

You can declare a global variable to be thread-private in all parallel regions and maintain its value across all parallel regions. This concept is similar to a thread-private static global variable. To do this, use the threadprivate directive.

Note: There is a distinction between a private variable (thread and parallel region specific) and a thread-private variable (thread specific).

Listing 14. Threadprivate directive

Click to see code listing

Listing 14. Threadprivate directive

int a=0;
#pragma omp threadprivate(a)

// identifier "a" in non-parallel sections will point to the threadprivate version belonging to the master thread (a.k.a thread number 0)

#pragma omp parallel
{
	a++
	// a=1 for each thread
}

// a=1 for master thread

#pragma omp parallel
{
	a++
	// a=2 for each thread
}

Before a parallel section starts to run, you might want to copy the value of a thread-private variable in the master thread to all other threads. This is achieved with copyin. An example of copyin is shown in Listing 15.

Listing 15. Copyin clause
int a=4;
#pragma omp threadprivate(a)

#pragma omp parallel copyin(a)
{
	a++
	// a=5 for each thread
}

When multiple threads read and write to shared variables, you might want make sure that all threads have "flushed" their updates to the original memory location, and re-read them into their registers. This ensures all threads have the same view of these shared variables and is shown in Listing 16.

Listing 16. Flush directive
int shared;
#pragma omp parallel
{
	// a certain thread might update the variable "shared"
	
	//flush to ensure all threads read update
	#pragma omp flush(shared)
	// all threads are now at the same execution point

	// work with updated value of "shared"
}

Use the reduction clause to combine the values of private variables after a parallel section is complete and store that in the global variable in the master thread. . Specify the name of the variable and the operator used to combine the values. Listing 17 is an example of uses the "reduction" clause which adds the private variables and stores them in the master thread variable. Note that the "reduction" clause implicitly gives each thread its own copy of the variable with a default initial value.

Listing 17. Reduction clause
int a,b;
#pragma omp parallel reduction(+:a) reduction(*:b)
{
	a=4;
	b=3;
	a++
	// a=5 for each thread
}

// a = (5 * number of threads)
// b = (b ^ number of threads)

Synchronization

As a parallel programmer, you need to account for critical sections. These are sections of the code that should be run by one thread at a time. This is usually required when data shared by parallel threads is being modified. You can use the critical directive to create a critical section, as shown in Listing 18.

Listing 18. Critical directive
#pragma omp parallel
{
	// parallel code
	#pragma omp critical
	{
		// critical section run
		// by exactly one thread at a time
	}
}

Create a barrier if you need all parallel threads to reach a certain point before any one of them can proceed further.. This means that no thread will go past a barrier until all other threads have reached the barrier, as shown in Listing 19.

Note: There is an implicit barrier at the end of every parallel section. The master thread cannot proceed until all threads have finished parallel section execution.

Listing 19. Barrier directive
#pragma omp parallel
{
	// parallel code that makes threads go out of step (e.g. A critical section)
	#pragma omp critical
	{
		// critical section run by exactly one thread at a time
	}

	#pragma omp barrier
	// all threads are now at the same execution point

	// parallel code
}

Compiling for OpenMP

You have now written code that uses OpenMP parallelism. When you compile this code using your regular compile options, you will not see parallel execution. In fact, it runs it as though it were single threaded. To produce an executable that utilizes OpenMP parallelism, here are some options to consider at compile time:

  • -qSMP=EXPLICIT enables symmetric multi-processing and OpenMP parallel execution. A side effect is that -O2 and -qHOT are enabled automatically. You can also just use -qSMP since EXPLICIT is a default sub-option.
  • -q64 is required because symmetric multi-processing only works in 64-bit mode. This means OpenMP only works in 64-bit mode.
  • -O2 and -qHOT are side effects of -qSMP. However, you can minimize optimization within parallel code sections by specifying -qSMP=NOOPT. This makes debugging parallel sections easier. In the default case, if nothing is specified, it is -qSMP=OPT.

Limitations

The following limitations regarding OpenMP on XLC v2.1 for z/OS are:

  • Debugger support is not available for programs compiled with -qSMP=EXPLICIT.
  • Nested parallelism is not supported. This means you should not use a "omp parallel" block inside another "omp parallel" block.

Summary

The examples in this article illustrate the power and usability of the OpenMP API to create parallel programs on z/OS.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Rational software on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Rational
ArticleID=959691
ArticleTitle=Parallelize C/C++ code on z/OS with OpenMP
publish-date=01082014