Understanding Block and Block-Cyclic Matrix Distributions

/* Begin Cblas context */ /* We assume that we have 4 processes and place them in a 2-by-2 grid */ int ctxt, myid, myrow, mycol, numproc; int procrows = 2, proccols = 2; Cblacs_pinfo(&myid, &numproc); Cblacs_get(0, 0, &ctxt); Cblacs_gridinit(&ctxt, "Row-major", procrows, proccols);

Block decompositions of matrices as you've described in your question are a perfectly valid way of distributing a matrix, but it is not the only way to do it.

In particular, block data distributions (breaking up the matrix into procrows x process sub-matrices) is a little inflexible. If the matrix size is not divisible by the number of processes in a row or column - and generally you have no control over the size of the matrix, and only some flexibility with procrows/proccols - you can end up with serious load balancing issues. Also, it's very handy sometimes to be able to "over decompose" a problem; to break it up into more pieces than you have tasks. In particular, for MPI, since each task is a process, it's sometimes useful to be able to have multiple sub-matrices for each process to operate on so that you can tackle this additional level of parallelism with threading (which is built in to most single-process linear algebra libraries).

The way to get the most flexibility for load-balancing, and to have the highest degree of inter-process parallelism available, is a purely cyclic distribution. In a 1d cyclic distribution, say dividing 15 items up between 4 processors, processor 1 would get item 1, 2 would get item 2, 3 would get item 3, 4 would get 4, and then processor 1 would get item 5, and so on; you round robin items across the processors.

On the other hand, in a 1d block decomposition, processor 1 would get items 1-4, processor 2 would get 5-9, and so on.

A figure from the useful LLNL parallel computing tutorial follows, with each color labelling which processor got a region of the data:

enter image description here

So the cyclic decomposition is maximally good for parallelism and load-balancing, but it's terrible for data access; every neighbouring piece of data you'd want to be able to access to do linear algebra operations is off-processor. On the other hand, the block decomposition is maximally good for data access; you have as large a contiguous chunk of data as possible, so you can do matrix operations on nice big sub-matrices; but it's inflexible for parallelism and can cost in terms of load-balancing.

Block-Cyclic is an interpolation between the two; you over decompose the matrix into blocks, and cyclicly distribute those blocks across processes. This lets you tune the tradeoff between data access contiguity and flexibility. If the block-cyclic block sizes are 1, you have a cyclic distribution; if they are N/procrows or N/proccols you have a block distribution; but you can also have anything in between.

Note that in 2D you can in principle choose different decompositions along the rows and columns, and sometimes that's useful if your matrix is only going to be used in one sort of computation; but a more usual case is that the decomposition is the same among all dimensions, so that when people say a "block decomposition" or "block-cyclic decomposition" they generally mean that along all dimensions.

A good description of this is on the Scalapack pages at netlib.

Recommended topics

Hot tags