Using valgrind to spot error in mpi code
Asked Answered
S

3

17

I have a code which works perfect in serial but with mpirun -n 2 ./out it gives the following error:

./out': malloc(): smallbin double linked list corrupted: 0x00000000024aa090

I tried to use valgrind such as:

valgrind --leak-check=yes mpirun -n 2 ./out

I got the following output:

==3494== Memcheck, a memory error detector
==3494== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==3494== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==3494== Command: mpirun -n 2 ./out
==3494== 
Grid_0/NACA0012.msh
Grid_0/NACA0012.msh
>>> Number of cells: 7734
>>> Number of cells: 7734
0.000000  0         1.470622e-02
*** Error in `./out': malloc(): smallbin double linked list corrupted: 0x00000000024aa090 ***

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 3497 RUNNING AT orhan
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
==3494== 
==3494== HEAP SUMMARY:
==3494==     in use at exit: 131,120 bytes in 2 blocks
==3494==   total heap usage: 1,064 allocs, 1,062 frees, 231,859 bytes allocated
==3494== 
==3494== LEAK SUMMARY:
==3494==    definitely lost: 0 bytes in 0 blocks
==3494==    indirectly lost: 0 bytes in 0 blocks
==3494==      possibly lost: 0 bytes in 0 blocks
==3494==    still reachable: 131,120 bytes in 2 blocks
==3494==         suppressed: 0 bytes in 0 blocks
==3494== Reachable blocks (those to which a pointer was found) are not shown.
==3494== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==3494== 
==3494== For counts of detected and suppressed errors, rerun with: -v
==3494== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

I am not good in valgrind but what I understood is valgrind saw no problem. Are there better options for valgrind to spot the source of the specific error mentioned?

Sectarian answered 18/1, 2016 at 9:48 Comment(0)
V
37

Note that with the invocation above,

valgrind --leak-check=yes mpirun -n 2 ./out

you are running valgrind on the program mpirun, which presumably has been extensively tested and works correctly, and not the program ./out, which you know to have a problem.

To run valgrind on your test program you will want to do:

mpirun -n 2 valgrind --leak-check=yes ./out

Which uses mpirun to launch 2 processes, each running valgrind --leak-check=yes ./out.

Verla answered 18/1, 2016 at 15:33 Comment(2)
As a side note, running valgrind on an mpi program will often generate lots of false positives to fish through; it's always worth trying running it through valgrind as a serial program, as well. The serial program doesn't actually cause a crashing error, but the erroneous memory accesses that cause the error may still be occurring, but with less drastic consequences.Verla
Also worth noting is this page from valgrind on running with mpi: valgrind.org/docs/manual/mc-manual.html#mc-manual.mpiwrapMcgruder
S
29

You can never go wrong with a Jonathan Dursi answer but let me just add that with more than one processor it can be a pain to read valgrind output.

Instead of outputting to the console, dump it to a log file. Of course if you dump both processes to the same log file that's not going to be helpful. Instead, log to multiple files -- valgrind interprets '%p' as the process id so you get two (or more) log files:

mpiexec -np 2 valgrind --leak-check=full \
    --show-reachable=yes --log-file=nc.vg.%p ./noncontig_coll2 -fname blah
Songsongbird answered 20/1, 2016 at 20:32 Comment(0)
E
0

You can also choose to run valgrind on the outside, but have it track all the forks using --trace-children=yes. I do this because it is less typing.

Eximious answered 14/6, 2023 at 23:43 Comment(1)
That approach only works with mpirun and on a single node.Gerbold

© 2022 - 2024 — McMap. All rights reserved.