I have a bit of research related question.
Currently I have finished implementation of structure skeleton frame work based on MPI (specifically using openmpi 6.3). the frame work is supposed to be used on single machine. now, I am comparing it with other previous skeleton implementations (such as scandium, fast-flow, ..)
One thing I have noticed is that the performance of my implementation is not as good as the other implementations. I think this is because, my implementation is based on MPI (thus a two sided communication that require the match of send and receive operation) while the other implementations I am comparing with are based on shared memory. (... but still I have no good explanation to reason out that, and it is part of my question)
There are some big difference on completion time of the two categories.
Today I am also introduced to configuration of open-mpi for shared memory here => openmpi-sm
and there come comes my question.
1st what does it means to configure MPI for shared memory? I mean while MPI processes live in their own virtual memory; what really is the flag like in the following command do? (I thought in MPI every communication is by explicitly passing a message, no memory is shared between processes).
shell$ mpirun --mca btl self,sm,tcp -np 16 ./a.out
2nd why is the performance of MPI is so much worse with compared to other skeleton implementation developed for shared memory? At least I am also running it on one single multi-core machine. (I suppose it is because other implementation used thread parallel programming, but I have no convincing explanation for that).
any suggestion or further discussion is very welcome.
Please let me know if I have to further clarify my question.
thank you for your time!