#pragma omp distribute parallel for
Purpose
The omp distribute parallel for directive executes a loop using multiple teams where each team typically consists of several threads. The loop iterations are distributed across the teams in chunks in round robin fashion.
Syntax
.-+---+------. | '-,-' | V | >>-#--pragma--omp distribute parallel for----+--------+-+------>< '-clause-'
>>-for-loops---------------------------------------------------><
Parameters
The omp distribute parallel for construct is a composite construct. clause can be any of the clauses that are accepted by the omp distribute or omp parallel for directive except the linear and ordered clauses. The specified clause has identical meanings and restrictions applied as used in the omp distribute or omp parallel for directive.
Usage
The omp distribute parallel for directive takes effect only if you specify both the -qsmp and -qoffload compiler options.Rules
If any specified clause except the collapse clause is applicable to both the omp distribute and omp parallel for directives, it is applied twice; the collapse clause is applied only once.
The iterations of the loops that are associated with the omp distribute parallel for directive are distributed across the teams that bind to the loop construct. Each team is assigned a chunk of the loop iterations. The size of the chunks is determined according to the clauses that apply to the omp distribute directive. Each chunk forms a parallel loop, and the parallel loop is distributed across the threads that participate in the team region according to the clauses that apply to the omp parallel for directive.
Examples
const int N = 8;
int A[N], B[N], C[N];
int k = 4;
int nteams = 16;
int block_threads = N/nteams;
for(int i=0; i<N; ++i)
{
A[i] = 0;
B[i] = i;
C[i] = 3*i;
}
#pragma omp target map(tofrom: A) map(to: B, C)
#pragma omp teams num_teams(nteams)
#pragma omp distribute parallel for dist_schedule(static, block_threads)
for(int i=0; i<N; ++i)
{
A[i] = B[i] + k*C[i];
}
In the beginning, the arrays A, B, and C and the scalar variables k, nteams, and block_threads are declared and initialized in the host environment.
Then, a target region is declared, and the arrays A, B, and C are explicitly mapped into the device environment. At the start of the target region, storage for A, B, and C is allocated on the device. The device copy of each array is then initialized with the content of the corresponding array on the host. The scalar variables k, nteams, and block_threads are implicitly mapped by the compiler as firstprivate because they are not explicitly mapped and the defaultmap(tofrom:scalar) clause is not present.
The target region is executed by the nteams teams of threads.
The loop iterations are first distributed across the teams in chunks of size equal to the value of the block_threads variable. Each chunk of iterations is further distributed across the threads in each team.
At the end of the target region, the copy of array A on the device is copied back into the host environment.



