Do Fortran 95 constructs such as WHERE, FORALL and SPREAD generally result in faster parallel code?

Asked 8/11, 2010 at 7:54 Answered 17/3, 2012 at 13:35

I have read through the Fortran 95 book by Metcalf, Reid and Cohen, and Numerical Recipes in Fortran 90. They recommend using WHERE, FORALL and SPREAD amongst other things to avoid unnecessary serialisation of your program.

However, I stumbled upon this answer which claims that FORALL is good in theory, but pointless in practice - you might as well write loops as they parallelise just as well and you can explicitly parallelise them using OpenMP (or automatic features of some compilers such as Intel).

Can anyone verify from experience whether they have generally found these constructs to offer any advantages over explicit loops and if statements in terms of parallel performance?

And are there any other parallel features of the language which are good in principal but not worth it in practice?

I appreciate that the answers to these questions are somewhat implementation dependant, so I'm most interested in gfortran, Intel CPUs and SMP parallelism.

Diligent answered 8/11, 2010 at 7:54 Comment(0)

As I said in my answer to the other question, there is a general belief that FORALL has not been as useful as was hoped when it was introduced to the language. As already explained in other answers, it has restrictive requirements and a limited role, and compilers have become quite good at optimizing regular loops. Compilers keep getting better, and capabilities vary from compiler to compiler. Another clue is that the Fortran 2008 is trying again... besides adding explicit parallelization to the language (co-arrays, already mentioned), there is also "do concurrent", a new loop form that requires restrictions that should better allow the compiler to perform automatic parallization optimizations, yet should be sufficiently general to be useful -- see ftp://ftp.nag.co.uk/sc22wg5/N1701-N1750/N1729.pdf.

In terms of obtaining speed, mostly I select good algorithms and program for readability & maintainability. Only if the program is too slow do I locate the bottle necks and recode or implement multi-threading (OpenMP). It will be a rare case where FORALL or WHERE versus an explicit do loop will have a meaningful speed difference -- I'd look more to how clearly they state the intent of the program.

Bouzoun answered 10/11, 2010 at 5:43 Comment(0)

I've looked shallowly into this and, sad to report, generally find that writing my loops explicitly results in faster programs than the parallel constructs you write about. Even simple whole-array assignments such as A = 0 are generally outperformed by do-loops.

I don't have any data to hand and if I did it would be out of date. I really ought to pull all this into a test suite and try again, compilers do improve (sometimes they get worse too).

I do still use the parallel constructs, especially whole-array operations, when they are the most natural way to express what I'm trying to achieve. I haven't ever tested these constructs inside OpenMP workshare constructs. I really ought to.

Duwalt answered 8/11, 2010 at 9:16 Comment(1)

I didn't ask about whole array operations because in many cases they make code clearer so even without performance gain I'd use them anyway. Spread creates an extra dimension along an array and copies the array along it: liv.ac.uk/HPC/HTMLF90Course/HTMLF90CourseNotesnode259.html. Regarding performance tests, I am less interested in optimising a particular case, and more interested in finding the best general approach to start with before I start optimising. – Diligent 8/11, 2010 at 9:51

FORALL is a generalised masked assignment statement (as is WHERE). It is not a looping construct.

Compilers can parallelise FORALL/WHERE using SIMD instructions (SSE2, SSE3 etc) and is very useful to get a bit of low-level parallelisation. Of course, some poorer compilers don't bother and just serialise the code as a loop.

OpenMP and MPI is more useful at a coarser level of granularity.

Storfer answered 17/3, 2012 at 13:35 Comment(0)

In theory, using such assignments lets the compiler know what you want to do and should allow it to optimize it better. In practice, see the answer from Mark... I also think it's useful if the code looks cleaner that way. I have used things such as FORALL myself a couple of times, but didn't notice any performance changes over regular DO loops.

As for optimization, what kind of parallellism do you intent to use? I very much dislike OpenMP, but I guess if you inted to use that, you should test these constructs first.

Rattlepate answered 9/11, 2010 at 11:29 Comment(2)

I have used OpenMP in the past and was able to get linear speedup for some of my problems, at least on a small number of CPUs. This seems to necessitate using DO rather than FORALL, thus rendering this construct a bit useless. If you don't like OpenMP, what other method would you use to parallelise loops? – Diligent 9/11, 2010 at 12:26

Well, I prefer MPI, it's more scalable and I like it more since using OpenMP got me into trouble for more complex parallel tasks about what's local to the parallel region and what's not. The upside of MPI for me is that it is much easier to think about and implement parallel routines. So, for my brain, OpenMP is only usable for the most simple of routines. – Rattlepate 9/11, 2010 at 12:45

*This should be a comment, not an answer, but it won't fit into that little box, so I'm putting it here. Don't hold it against me :-) Anyways, to continue somewhat onto @steabert's comment on his answer. OpenMP and MPI are two different things; one rarely gets to choose between the two since it's more dictated by your architecture than personal choice. As far as learning concepts of paralellism go, I would recommend OpenMP any day; it is simpler and one easily continues the transition to MPI later on.

But, that's not what I wanted to say. This is - a few days back from now, Intel has announced that it has started supporting Co-Arrays, a F2008 feature previously only supported by g95. They're not intending to put down g95, but the fact remains that Intel's compiler is more widely used for production code, so this is definitely an interesting line of developemnt. They also changed some things in their Visual Fortran Compiler (the name, for a start :-)

More info after the link: http://software.intel.com/en-us/articles/intel-compilers/

Gnu answered 10/11, 2010 at 1:32 Comment(1)

I disagree with "one rarely gets to choose between the two since it's more dictated by your architecture than personal choice", since I believe that MPI is more architecture-independent than OpenMP. For the latter, you are stuck with shared-memory architectures. – Rattlepate 14/3, 2011 at 8:47

Recommended topics

Hot tags