An important aspect of the data distributions described here is that independent
distributions are applied over each dimension of the data structure. The algorithms presented here
for the vector in one dimension can, therefore, be used for the rows and columns of a matrix, or
even for data structures with more dimensions.
Consider the distribution of a vector x of M data
objects (elements) over P processes. This
can be described by a mapping of the global index m (0 ≤ m < M)
of a data object to an index pair (p,i),
where p (0 ≤ p < P)
specifies the process to which the data object is mapped, and i specifies
its location in the local array.
Two common distributions are the
block and
cyclic.
The block distribution is often used when the computational load
is distributed homogeneously over a regular data structure, such as
a Cartesian grid. It assigns blocks of size
r of
the global vector to the processes. For block distribution, the mapping
m(
p,
i)
is defined as:
- m↦—>(floor(m/L), m mod L)
where
L = ceiling(
M/
P).
The cyclic distribution (also known as the wrapped or scattered decomposition)
is commonly used to improve load balance when the computational load
is distributed inhomogeneously over a regular data structure. The
cyclic distribution assigns consecutive entries of the global vector
to successive processes. For cyclic distribution, the mapping
m(
p,
i)
is defined as:
- m↦—>(m mod P,
floor(m/P))
Examples of block and cyclic distribution are shown in
Table 1 and
Table 2,
where
M = 23 data objects are distributed
over
P = 3 processes, using
r = 8
block size. As shown in the examples, there can be uneven distribution,
where the last block is smaller than the others. A global block number
B is
shown for block distribution. For cyclic distribution, there is no
concept of block numbers.
Table 1. Block Distribution
Block Distribution |
m |
0 1 2 3 4 5 6 7 |
8 9 10 11 12 13 14 15 |
16 17 18 19 20 21 22 |
p |
0 0 0 0 0 0 0 0 |
1 1 1 1 1 1 1 1 |
2 2 2 2 2 2 2 |
i |
0 1 2 3 4 5 6 7 |
0 1 2 3 4 5 6 7 |
0 1 2 3 4 5 6 |
B |
0 0 0 0 0 0 0 0 |
1 1 1 1 1 1 1 1 |
2 2 2 2 2 2 2 |
Table 2. Cyclic Distribution
Cyclic Distribution |
m |
0 1 2 |
3 4 5 |
6 7 8 |
9 10 11 |
12 13 14 |
15 16 17 |
18 19 20 |
21 22 |
p |
0 1 2 |
0 1 2 |
0 1 2 |
0 1 2 |
0 1 2 |
0 1 2 |
0 1 2 |
0 1 |
i |
0 0 0 |
1 1 1 |
2 2 2 |
3 3 3 |
4 4 4 |
5 5 5 |
6 6 6 |
7 7 |
The block-cyclic distribution is a generalization of the block
and cyclic distributions, in which blocks of
r consecutive
data objects are distributed cyclically over the
p processes.
This can be described by a mapping of the global index
m (0
≤ m < M)
of a data object to an index triplet (
p,b,i),
where
p (0
≤ p < P)
specifies the process to which the data object is mapped,
b is
the block number in process
p, and
i is
the location in the block. For block-cyclic distribution, the mapping
m(
p,
b,
i)
is defined as:
- m↦—>(floor((m mod T)/r),
floor(m/T), m mod r)
where
T = rP.
(It should be noted that this reverts to the cyclic distribution when
r = 1
and a block distribution when
r = L.)
The inverse mapping to a global index (
p,
b,
i)
m is
defined by:
where
B = p+
bP is
the global block number. An example of block-cyclic distribution is
shown in
Table 3, where
M = 23
data objects are distributed over
P = 3
processes, using
r = 2 block size. As shown in the example, there can be uneven distribution, where the last block
is smaller than the others. The inverse mapping is shown in the second
part of the example. (This shows what is stored in the local array
on each of the three processes.)
Table 3. Block-Cyclic Distribution
Block-Cyclic Distribution |
m |
0 1 2 3 4 5 |
6 7 8 9 10 11 |
12 13 14 15 16 17 |
18 19 20 21 22 |
p |
0 0 1 1 2 2 |
0 0 1 1 2 2 |
0 0 1 1 2 2 |
0 0 1 1 2 |
b |
0 0 0 0 0 0 |
1 1 1 1 1 1 |
2 2 2 2 2 2 |
3 3 3 3 3 |
i |
0 1 0 1 0 1 |
0 1 0 1 0 1 |
0 1 0 1 0 1 |
0 1 0 1 0 |
B |
0 0 1 1 2 2 |
3 3 4 4 5 5 |
6 6 7 7 8 8 |
9 9 10 10 11 |
Table 4. Inverse Mapping
of Block-Cyclic Distribution
Inverse Mapping of Block-Cyclic
Distribution |
m |
0 1 6 7 12 13 18 19 |
2 3 8 9 14 15 20 21 |
4 5 10 11 16 17 22 |
p |
0 0 0 0 0 0 0 0 |
1 1 1 1 1 1 1 1 |
2 2 2 2 2 2 2 |
b |
0 0 1 1 2 2 3 3 |
0 0 1 1 2 2 3 3 |
0 0 1 1 2 2 3 |
i |
0 1 0 1 0 1 0 1 |
0 1 0 1 0 1 0 1 |
0 1 0 1 0 1 0 |
B |
0 0 3 3 6 6 9 9 |
1 1 4 4 7 7 10 10 |
2 2 5 5 8 8 11 |
In decomposing an m × n matrix, A,
independent block-cyclic distributions are applied in the row and
column directions. Thus, suppose the matrix rows are distributed
with block size r over P processes
by the λr,P block-cyclic
mapping, and the matrix columns are distributed with block size s over Q processes
by the ψs,Q block-cyclic
mapping. Then the matrix element indexed globally by (m, n)
is mapped as follows:
The distribution of the matrix can be regarded as the tensor product
of the row and column distributions, which can be expressed as:
- (m, n)((p, q),(b, d),(i, j))
The block-cyclic matrix distribution expressed previously distributes
blocks of size r × s to
a grid of P × Q processes.
An example of block-cyclic distribution of an
m × n = 16
× 30 matrix
with block size
r × s = 3
× 4 and a
P × Q = 2
× 3 process
grid is shown in
Table 5 and
Table 6. The numbers in the leftmost column and
on the top of the matrix represent the global row and column numbers
B and
D,
respectively.
Table 5 shows the assignment
of global blocks (
B,D) to processes (P,Q).
Table 6 shows which global blocks each process
contains.
In this example, the global matrix dimensions are not divisible
by the respective block sizes. All the row blocks are of size 3, except
the last row block, which only contains 1 row. All column blocks
are of size 4, except the last column block, which contains 2 columns.
For example, global block (5,0) is 1
× 4, global
block (1,7) is 3
× 2,
and global block (0,0) is 3
× 4. The global
block (5,7) is 1
× 2.
The asterisk (*) in
Table 5 denotes which
global blocks contain left over data; that is, the blocks that are
not 3
× 4.
Table 5. Block Distribution
Over a 2 by 3 Process Grid
B,D |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
0 |
P00 |
P01 |
P02 |
P00 |
P01 |
P02 |
P00 |
P01* |
1 |
P10 |
P11 |
P12 |
P10 |
P11 |
P12 |
P10 |
P11* |
2 |
P00 |
P01 |
P02 |
P00 |
P01 |
P02 |
P00 |
P01* |
3 |
P10 |
P11 |
P12 |
P10 |
P11 |
P12 |
P10 |
P11* |
4 |
P00 |
P01 |
P02 |
P00 |
P01 |
P02 |
P00 |
P01* |
5 |
P10* |
P11* |
P12* |
P10* |
P11* |
P12* |
P10* |
P11* |
Table 6. Data Distribution from a Process
Point-of-View
B,D |
0 |
3 |
6 |
1 |
4 |
7 |
2 |
5 |
0 |
|
|
|
|
|
* |
|
|
2 |
|
P00 |
|
|
P01 |
* |
|
P02 |
4 |
|
|
|
|
|
* |
|
|
1 |
|
|
|
|
|
* |
|
|
3 |
|
P10 |
|
|
P11 |
* |
|
P12 |
5 |
* |
* |
* |
* |
* |
* |
* |
* |
Table 7. Distributed Matrix elements from a Process Point-of-View
B,D |
0 |
3 |
6 |
1 |
4 |
7 |
2 |
5 |
0 |
a0:2,0:3 |
a0:2,12:15 |
a0:2,24:27 |
a0:2,4:7 |
a0:2,16:19 |
a0:2,28:29* |
a0:2,8:11 |
a0:2,20:23 |
2 |
a6:8,0:3 |
a6:8,12:15 |
a6:8,24:27 |
a6:8,4:7 |
a6:8,16:19 |
a6:8,28:29* |
a6:8,8:11 |
a6:8,20:23 |
4 |
a12:14,0:3 |
a12:14,12:15 |
a12:14,24:27 |
a12:14,4:7 |
a12:14,16:19 |
a12:14,28:29* |
a12:14,8:11 |
a12:14,20:23 |
1 |
a3:5,0:3 |
a3:5,12:15 |
a3:5,24:27 |
a3:5,4:7 |
a3:5,16:19 |
a3:5,28:29* |
a3:5,8:11 |
a3:5,20:23 |
3 |
a9:11,0:3 |
a9:11,12:15 |
a9:11,24:27 |
a9:11,4:7 |
a9:11,16:19 |
a9:11,28:29* |
a9:11,8:11 |
a9:11,20:23 |
5 |
a15,0:3* |
a15,12:15* |
a15,24:27* |
a15,4:7* |
a15,16:19* |
a15,28:29* |
a15,8:11* |
a15,20:23* |