Understanding Block and Block-Cyclic Matrix Distributions
Asked Answered
R

1

10

In working with parallel decompositions of matrices, I'm familiar with a block distribution, where we have (say) 4 processes, each with their own subregion of the matrix:

Block Matrix Decomposition

So for instance here we have the number of processes in a row (procrows) equal to 2, the number of processes in a column (proccols) also equal to two, and the sub-matrices A_local would have size N/2 x M/2 if the original matrix size is N x M.

I am reading this example which uses a "block-cyclic" distribution, and in this part:

/* Begin Cblas context */
/* We assume that we have 4 processes and place them in a 2-by-2 grid */
int ctxt, myid, myrow, mycol, numproc;
int procrows = 2, proccols = 2;
Cblacs_pinfo(&myid, &numproc);
Cblacs_get(0, 0, &ctxt);
Cblacs_gridinit(&ctxt, "Row-major", procrows, proccols);

they have procrows and proccols are hardcoded, fine, but for a matrix that's read in, there's a header:

Nb and Mb will be the number of rows and columns of the blocks [of the matrix]

I don't understand this; aren't Nb and Mb completely determined by N, M, procrows, and proccols?


EDIT

From running the example I can see that the submatrix on process 0 has all the elements of the left up corner of the matrix, just like in my picture above, something that comes in contradiction with Jonathan's answer. However, it works fine with ScaLAPACK's Cholesky.

Respirable answered 26/6, 2015 at 15:26 Comment(11)
@VladimirF I executed the code and submatrices are 4x4 and the program is said to be called with cmd args that set Nb = Mb = 2. Thanks for your comment, but it doesn't make sense to me, at least at this point. :/Respirable
Well @VladimirF I can't put the whole information from there, so I am not sure what to add. So then, why having them the same? I mean, I want to do dynamic, in order to work with less processes in a smaller matrix. Should I make procrows = Nb? Also, are you going to lay an answer?Respirable
@VladimirF I did add more information in the question, is it better now?Respirable
The NUMROC documentation contains some description. computing.llnl.gov/tutorials/allineaDDT/programs/trisol/… www-01.ibm.com/support/knowledgecenter/SSNR5K_4.2.0/… I am not sure what exactly a block is. Youmust study it yourself, I do not use this library at all.Korwun
Please choose a better title.Mistrial
@Jeff really hard to come up with a good one, is this better or should I undo the edit? Vladimir thanks again for trying to help. I have seen that and I do not know either. If I knew the place to study, I would.Respirable
Nb and Mb mean nothing to the average SO user. The subject of your question is CBLACS. That should be in the title. Remember to write titles that attract the appropriate expert. I saw MPI and thought I could help, but since I am not a ScaLAPACK guru, I can't help here. I would have skipped the question if I'd know it was related to CBLACS.Mistrial
I see @Jeff. Replaces MPI with CBLACS.Respirable
This isn't a question about CBLACS, BLACS, MPI, or SCALAPACK; it's a question about block-cyclic distributions of matrices. It is quite common - in fact, the usual case - for the number of processes in a row to be different than (smaller than) the number of blocks in a row, although it's legal for them to be equal.Novena
Sorry @JonathanDursi, did not know that!Respirable
I would really like to know why the downvote!Respirable
N
16

Block decompositions of matrices as you've described in your question are a perfectly valid way of distributing a matrix, but it is not the only way to do it.

In particular, block data distributions (breaking up the matrix into procrows x process sub-matrices) is a little inflexible. If the matrix size is not divisible by the number of processes in a row or column - and generally you have no control over the size of the matrix, and only some flexibility with procrows/proccols - you can end up with serious load balancing issues. Also, it's very handy sometimes to be able to "over decompose" a problem; to break it up into more pieces than you have tasks. In particular, for MPI, since each task is a process, it's sometimes useful to be able to have multiple sub-matrices for each process to operate on so that you can tackle this additional level of parallelism with threading (which is built in to most single-process linear algebra libraries).

The way to get the most flexibility for load-balancing, and to have the highest degree of inter-process parallelism available, is a purely cyclic distribution. In a 1d cyclic distribution, say dividing 15 items up between 4 processors, processor 1 would get item 1, 2 would get item 2, 3 would get item 3, 4 would get 4, and then processor 1 would get item 5, and so on; you round robin items across the processors.

On the other hand, in a 1d block decomposition, processor 1 would get items 1-4, processor 2 would get 5-9, and so on.

A figure from the useful LLNL parallel computing tutorial follows, with each color labelling which processor got a region of the data:

enter image description here

So the cyclic decomposition is maximally good for parallelism and load-balancing, but it's terrible for data access; every neighbouring piece of data you'd want to be able to access to do linear algebra operations is off-processor. On the other hand, the block decomposition is maximally good for data access; you have as large a contiguous chunk of data as possible, so you can do matrix operations on nice big sub-matrices; but it's inflexible for parallelism and can cost in terms of load-balancing.

Block-Cyclic is an interpolation between the two; you over decompose the matrix into blocks, and cyclicly distribute those blocks across processes. This lets you tune the tradeoff between data access contiguity and flexibility. If the block-cyclic block sizes are 1, you have a cyclic distribution; if they are N/procrows or N/proccols you have a block distribution; but you can also have anything in between.

Note that in 2D you can in principle choose different decompositions along the rows and columns, and sometimes that's useful if your matrix is only going to be used in one sort of computation; but a more usual case is that the decomposition is the same among all dimensions, so that when people say a "block decomposition" or "block-cyclic decomposition" they generally mean that along all dimensions.

A good description of this is on the Scalapack pages at netlib.

Novena answered 29/6, 2015 at 13:17 Comment(5)
Jonathan thanks for answering! I am struggling over here! The final link was good. However, I am still confused. The link concludes "The two-dimensional block cyclic distribution scheme is the data layout that is used in the ScaLAPACK library for dense matrix computations.". However, from running the example I linked to, the distribution seems to happen like the 2D right-up corner in your answer and it works with Cholesky's ScaLAPACK (I checked). More importantly, this is still unanswered IMHO: "aren't Nb and Mb completely determined by N, M, procrows, and proccols?"Respirable
@Respirable - No, Nb can be any integer between 1 and N/proccols and Mb can be any integer between 1 and M/procrows. Again, one is overdecomposing the matrix into smaller blocks, and then doing a cyclic distribution of these smaller blocks across processors; thus block-cyclic.Novena
I see, but even though the cmd arguments are Nb = 2 and Mb = 2, the local sub-matrices are of dimension 4x4 in the example. Aren't these the dimensions of the local matrix? Probably not...Respirable
@Respirable - My mistake; the example code is using Nb and Mb as the number of blocks rather than the sizes; so let me change that earlier comment to read: "No, Nb can be any integer between proccols and N, and Mb can be any integer between procrows and M. Again, one is overdecomposing the matrix into smaller blocks, they doing a cyclic distribution of these smaller blocks across processors; thus block-cyclic". Eg, edit the example data file to change Nb/Mb and see what happens.Novena
Jonathan the last figure in your final link was the answer. Thank you very much. To show my appreciation I also upvoted your top answer. You answered a question that was unanswered for a long and also received a downvote.Respirable

© 2022 - 2024 — McMap. All rights reserved.