Missing function from GProf output

D

2

4

I'm trying to profile some C code but one of the most intuitively costly functions isn't showing up in the GProf output.

int main() {
    initialise...
    haloSwap();

    for(...) {
        functions...

        propagate();

        functions...
    }
}

void propagate() {
    for (x)
        for (y)
            for (z)
                grid[xNew][yNew][zNew] = grid[x][y][z];

    haloSwap();
}

void haloSwap() {
    // Horizontal swap
    create buffers...
    MPI_Sendrecv(buffers);
    recreate grid from buffers...

    // Vertical swap
    create buffers...
    MPI_Sendrecv(buffers);
    recreate grid from buffers...
}

Hopefully that pseudo-code goes some way to explaining the set up. haloSwap() involves a lot of communication between threads and I feel it's an expensive part of the algorithm. It's called during initialisation and then repeatedly during the loop of the algorithm.

GProf shows only 1 call to haloSwap (during init), even though I know it's called 1000+ times from inside propagate().

propagate() is showing as the most expensive part of the code, but I'd like to know whether it's the xyz loop(s) or the MPI comminucation.

Does anyone know why the calls to haloSwap from propagate are seemingly ignored in both the number of calls and the time spent in the function?

haloSwap is defined within another .c file, which may be a factor?

If I move the call of haloSwap to the main loop after calling propagate (instead of inside it), GProf still only shows 1 call to it.

Dispensary answered 19/8, 2016 at 14:43 Comment(5)

You are probably compiling with optimisations and propagate() ends up being inlined. Tell your compiler to not inline functions. For example, with GCC the option is -fno-inline. – Max 19/8, 2016 at 14:57

add to @Hristo_Iliev comment. mpicc (from MPICH )wrapper uses the -O2 optimization which includes the -finline-small-functions flag . If there are functions that you want to be excluded from any optimizations use the gcc attributes on functions for example void haloSwap() __attribute__(optimize("-O0")) – Reaper 19/8, 2016 at 16:48

If you're using MPI, then it's doing I/O, and gprof is well-known to be blind to that, because sampling is suspended during I/O. You could be spending 99% of the time in I/O, but gprof would not show it. The same is true of anything calling itself a "CPU profiler". – Moiramoirai 25/8, 2016 at 18:54

@MikeDunlavey I suppose a poor man's way to get an idea of that is to check real elapsed time versus gprof reported CPU time? – Dispensary 26/8, 2016 at 12:24

@KallumBurgin: That would show the issue. Some people use this technique which works on real time, not just CPU time. It has less precision on timing, but more precision for locating problems. – Moiramoirai 26/8, 2016 at 16:2

D

1

@Hristo_Iliev and @Angelos were correct, I was compiling with optimisations and my problem was solved by compiling with -O0. Stupid mistake, I'm sure I've made it before.

Dispensary answered 25/8, 2016 at 11:36 Comment(1)

You should not solve the problem by compiling with -O0. Profiling purpose is to give an idea on the performance of the optimized code. Use the suggestion from @Hristo_Iliev and add -fno-inline while keeping O2 or O3 optimization. There is no added value measuring code compiled with O0, it is by definition badly performing. – Longoria 20/6, 2023 at 13:43

M

5

Profiling with optimizations enabled is absolutely not a stupid mistake. It is instead the way one should profile -- otherwise you'll be wasting time and effort hand-optimizing routines the compiler could have fixed for you.

If you need a different profile report than what gprof can provide, the google performance tools (aka gperftools) might be useful to you. The reports will record time spent in an inline function. Here's an example report where you can see how the google tools report inline functions:

% ~/soft/gperftools/bin/pprof --text \
    bigio_mpi bigio-profiling/prof-ca-fill-inline1-0.out| head -10
Using local file bigio_mpi.
Using local file bigio-profiling/prof-ca-fill-inline1-0.out.
Total: 29680 samples
10837  36.5%  36.5%    24056  81.1% ADIOI_P2PContigReadAggregation
3534  11.9%  48.4%     3534  11.9% _init@3a858
2954  10.0%  58.4%     2954  10.0% unshuffle_sse2
2448   8.2%  66.6%     2448   8.2% __memcpy_ssse3
1742   5.9%  72.5%     2661   9.0% type_create_contiguous_x (inline)
1373   4.6%  77.1%     1373   4.6% shuffle_sse2
1176   4.0%  81.1%     6205  20.9% ADIOI_Type_create_hindexed_x
1104   3.7%  84.8%    19082  64.3% ADIOI_Get_eof_offset
831   2.8%  87.6%    28184  95.0% ADIOI_Flatten

Macaque answered 25/8, 2016 at 18:21 Comment(1)

I needed information on a specific function that was being inlined, which I had disregarded. – Dispensary 26/8, 2016 at 12:21

D

1

@Hristo_Iliev and @Angelos were correct, I was compiling with optimisations and my problem was solved by compiling with -O0. Stupid mistake, I'm sure I've made it before.

Dispensary answered 25/8, 2016 at 11:36 Comment(1)

You should not solve the problem by compiling with -O0. Profiling purpose is to give an idea on the performance of the optimized code. Use the suggestion from @Hristo_Iliev and add -fno-inline while keeping O2 or O3 optimization. There is no added value measuring code compiled with O0, it is by definition badly performing. – Longoria 20/6, 2023 at 13:43

Recommended topics

Hot tags