Data distribution techniques

An important aspect of the data distributions described here is that independent distributions are applied over each dimension of the data structure. The algorithms presented here for the vector in one dimension can, therefore, be used for the rows and columns of a matrix, or even for data structures with more dimensions.

Consider the distribution of a vector x of M data objects (elements) over P processes. This can be described by a mapping of the global index m (0m < M) of a data object to an index pair (p,i), where p (0p < P) specifies the process to which the data object is mapped, and i specifies its location in the local array.

Two common distributions are the block and cyclic. The block distribution is often used when the computational load is distributed homogeneously over a regular data structure, such as a Cartesian grid. It assigns blocks of size r of the global vector to the processes. For block distribution, the mapping m(p, i) is defined as:
  • m—>(floor(m/L), m mod L)
where L = ceiling(M/P). The cyclic distribution (also known as the wrapped or scattered decomposition) is commonly used to improve load balance when the computational load is distributed inhomogeneously over a regular data structure. The cyclic distribution assigns consecutive entries of the global vector to successive processes. For cyclic distribution, the mapping m(p, i) is defined as:
  • m—>(m mod P, floor(m/P))
Examples of block and cyclic distribution are shown in Table 1 and Table 2, where M = 23 data objects are distributed over P = 3 processes, using r = 8 block size. As shown in the examples, there can be uneven distribution, where the last block is smaller than the others. A global block number B is shown for block distribution. For cyclic distribution, there is no concept of block numbers.
Table 1. Block Distribution
Block Distribution
m 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
p 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2
i 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6
B 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2
Table 2. Cyclic Distribution
Cyclic Distribution
m 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
p 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1
i 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7
The block-cyclic distribution is a generalization of the block and cyclic distributions, in which blocks of r consecutive data objects are distributed cyclically over the p processes. This can be described by a mapping of the global index m (0m < M) of a data object to an index triplet (p,b,i), where p (0p < P) specifies the process to which the data object is mapped, b is the block number in process p, and i is the location in the block. For block-cyclic distribution, the mapping m(p, b, i) is defined as:
  • m—>(floor((m mod T)/r), floor(m/T), m mod r)
where T = rP. (It should be noted that this reverts to the cyclic distribution when r = 1 and a block distribution when r = L.) The inverse mapping to a global index (p, b, i)m is defined by:
  • (p, b, i)Br+i = pr+bT+i
where B = p+bP is the global block number. An example of block-cyclic distribution is shown in Table 3, where M = 23 data objects are distributed over P = 3 processes, using r = 2 block size. As shown in the example, there can be uneven distribution, where the last block is smaller than the others. The inverse mapping is shown in the second part of the example. (This shows what is stored in the local array on each of the three processes.)
Table 3. Block-Cyclic Distribution
Block-Cyclic Distribution
m 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
p 0 0 1 1 2 2 0 0 1 1 2 2 0 0 1 1 2 2 0 0 1 1 2
b 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3
i 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
B 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11
Table 4. Inverse Mapping of Block-Cyclic Distribution
Inverse Mapping of Block-Cyclic Distribution
m 0 1 6 7 12 13 18 19 2 3 8 9 14 15 20 21 4 5 10 11 16 17 22
p 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2
b 0 0 1 1 2 2 3 3 0 0 1 1 2 2 3 3 0 0 1 1 2 2 3
i 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
B 0 0 3 3 6 6 9 9 1 1 4 4 7 7 10 10 2 2 5 5 8 8 11

In decomposing an m × n matrix, A, independent block-cyclic distributions are applied in the row and column directions. Thus, suppose the matrix rows are distributed with block size r over P processes by the λr,P block-cyclic mapping, and the matrix columns are distributed with block size s over Q processes by the ψs,Q block-cyclic mapping. Then the matrix element indexed globally by (m, n) is mapped as follows:

In decomposing an m x n matrix, A, independent block-cyclic distributions are applied in the row and column directions. Thus, suppose the matrix rows are distributed with block size r over P processes by the lambda(sub)r,P block-cyclic mapping, and the matrix columns are distributed with block size s over Q processes by the psi(sub)s,Q block-cyclic mapping. Then the matrix element indexed globally by (m, n) is mapped as shown in the graphic.:
The distribution of the matrix can be regarded as the tensor product of the row and column distributions, which can be expressed as:
  • (m, n)((p, q),(b, d),(i, j))

The block-cyclic matrix distribution expressed previously distributes blocks of size r × s to a grid of P × Q processes.

An example of block-cyclic distribution of an m × n = 16 × 30 matrix with block size r × s = 3 × 4 and a P × Q = 2 × 3 process grid is shown in Table 5 and Table 6. The numbers in the leftmost column and on the top of the matrix represent the global row and column numbers B and D, respectively. Table 5 shows the assignment of global blocks (B,D) to processes (P,Q). Table 6 shows which global blocks each process contains.
In this example, the global matrix dimensions are not divisible by the respective block sizes. All the row blocks are of size 3, except the last row block, which only contains 1 row. All column blocks are of size 4, except the last column block, which contains 2 columns. For example, global block (5,0) is 1 × 4, global block (1,7) is 3 × 2, and global block (0,0) is 3 × 4. The global block (5,7) is 1 × 2. The asterisk (*) in Table 5 denotes which global blocks contain left over data; that is, the blocks that are not 3 × 4.
Table 5. Block Distribution Over a 2 by 3 Process Grid
B,D 0 1 2 3 4 5 6 7
0 P00 P01 P02 P00 P01 P02 P00 P01*
1 P10 P11 P12 P10 P11 P12 P10 P11*
2 P00 P01 P02 P00 P01 P02 P00 P01*
3 P10 P11 P12 P10 P11 P12 P10 P11*
4 P00 P01 P02 P00 P01 P02 P00 P01*
5 P10* P11* P12* P10* P11* P12* P10* P11*
Table 6. Data Distribution from a Process Point-of-View
B,D 0 3 6 1 4 7 2 5
0           *    
2   P00     P01 *   P02
4           *    
1           *    
3   P10     P11 *   P12
5 * * * * * * * *
Table 7. Distributed Matrix elements from a Process Point-of-View
B,D 0 3 6 1 4 7 2 5
0 a0:2,0:3 a0:2,12:15 a0:2,24:27 a0:2,4:7 a0:2,16:19 a0:2,28:29* a0:2,8:11 a0:2,20:23
2 a6:8,0:3 a6:8,12:15 a6:8,24:27 a6:8,4:7 a6:8,16:19 a6:8,28:29* a6:8,8:11 a6:8,20:23
4 a12:14,0:3 a12:14,12:15 a12:14,24:27 a12:14,4:7 a12:14,16:19 a12:14,28:29* a12:14,8:11 a12:14,20:23
1 a3:5,0:3 a3:5,12:15 a3:5,24:27 a3:5,4:7 a3:5,16:19 a3:5,28:29* a3:5,8:11 a3:5,20:23
3 a9:11,0:3 a9:11,12:15 a9:11,24:27 a9:11,4:7 a9:11,16:19 a9:11,28:29* a9:11,8:11 a9:11,20:23
5 a15,0:3* a15,12:15* a15,24:27* a15,4:7* a15,16:19* a15,28:29* a15,8:11* a15,20:23*