Profiling help required

Asked 15/6, 2010 at 16:51 Answered 8/3, 2013 at 21:26

Solved c++c visual-studio-2008 intel-vtune

I have a profiling issue - imagine I have the following code...

void main()
{
    well_written_function();
    badly_written_function();
}
void well_written_function()
{
    for (a small number)
    {
        highly_optimised_subroutine();
    }
}
void badly_written_function()
{
    for (a wastefully and unnecessarily large number)
    {
        highly_optimised_subroutine();
    }
}
void highly_optimised_subroutine()
{
    // lots of code
}

If I run this under vtune (or other profilers) it is very hard to spot that anything is wrong. All the hotspots will appear in the section marked "// lots of code" which is already optimised. The badly_written_function() will not be highlighted in any way even though it is the cause of all the trouble.

Is there some feature of vtune that will help me find the problem?

Is there some sort of mode whereby I can find the time taken by badly_written_function() and all of its sub-functions?

Yellowhammer answered 15/6, 2010 at 16:51 Comment(6)

That's usually known as a "callgraph profile", and I'm fairly sure Visual Studio will do that. Although my memory is a bit hazy since it's been several years since I've done any Windows development. – Towel 15/6, 2010 at 17:0

I'm pretty sure vtune allows a breakdown where you can see and sort by the total time consumed including the functions that it calls. However, for that to be of use, you need a reasonable intuituion of how much time should be spent in which function. – Unpretentious 15/6, 2010 at 17:6

@torak: can remember the name of the feature? - I do have a rough grasp of how long certain functions should take so I am sure that such a breakdown would be very useful indeed. – Yellowhammer 15/6, 2010 at 17:11

The world would be a better place if Valgrind worked in Windows. – Trainload 15/6, 2010 at 17:57

@Mick: Sorry, can't recall off the top of my head. I'll try to remember to fire up VTune and check when I get home. – Unpretentious 15/6, 2010 at 18:23

@Mike Seymour: Yes, it looks like "call graph profile" is the concept I was looking for. I have just found it in the vtune help files and am playing with it right now. You should put your suggestion as an answer rather than a comment and I will mark it as correct so you can get the reputation points. – Yellowhammer 16/6, 2010 at 9:27

That's usually known as a "callgraph profile", and I'm fairly sure Visual Studio will do that.

Towel answered 16/6, 2010 at 10:51 Comment(0)

Rolling your own very simple profiler is not that hard. Insert into main():

int main()
{
    profileCpuUsage(1);                 // start timer #1
    well_written_function();
    profileCpuUsage(2);                 // stop timer #1, and start timer #2
    badly_written_function();
    profileCpuUsage(-1);                // print stats for timers #1 and #2
    return 0;
}

where:

#define NUMBER(a) ((int)(sizeof(a) / sizeof(a)[0]))

void profileCpuUsage(int slice)
{
    static struct {
        int iterations;
        double elapsedTime;
    } slices[30];                             // 0 is a don't care slice

    if (slice < 0) {                          // -1 = print
        if (slices[0].iterations)
            for (slice = 1; slice < NUMBER(slices); slice++)
                printf("Slice %2d  Iterations %7d  Seconds %7.3f\n", slice,
                    slices[slice].iterations, slices[slice].elapsedTime);
    }
    else {
        static int i;                         // = previous slice
        static double t;                      // = previous t1
        const double t1 = realElapsedTime();  // see below for definition
        assert (slice < NUMBER(slices));
        slices[i].iterations  += 1;
        slices[i].elapsedTime += t1 - t;      // i = 0 first time through
        i = slice;
        t = t1;
    }
}

Now admittedly in your simple example using this profileCpuUsage() doesn't add much benefit. And it has disadvantage of requiring you to manually instrument your code by calling profileCpuUsage() at suitable locations.

But advantages include:

You can time any code fragment, not just procedures.
It is quick to add and delete, as you do a binary search to find and/or remove code hotspots.
It focuses only on the code you are interested in.
Portable!
KISS

One tricky non-portable thing is to define the function realElapsedTime() so that it provides enough granularity to get valid times. This generally works for me (using the Windows API under CYGWIN):

#include <windows.h>
double realElapsedTime(void)   // <-- granularity about 50 microsec on test machines
{
    static LARGE_INTEGER freq, start;
    LARGE_INTEGER count;
    if (!QueryPerformanceCounter(&count))
        assert(0 && "QueryPerformanceCounter");
    if (!freq.QuadPart) {      // one time initialization
        if (!QueryPerformanceFrequency(&freq))
            assert(0 && "QueryPerformanceFrequency");
        start = count;
    }
    return (double)(count.QuadPart - start.QuadPart) / freq.QuadPart;
}

For straight Unix there is the common:

double realElapsedTime(void)                      // returns 0 first time called
{
    static struct timeval t0;
    struct timeval tv;
    gettimeofday(&tv, 0);
    if (!t0.tv_sec)
        t0 = tv;
    return tv.tv_sec - t0.tv_sec + (tv.tv_usec - t0.tv_usec) / 1000000.;
}

realElapsedTime() gives wall-clock time, not process time, which is usually what I want.

There are also other less-portable methods to achieve finer granularity using RDTSC; see for example http://en.wikipedia.org/wiki/Time_Stamp_Counter, and its links, but I haven't tried these.

Edit: ravenspoint's very nice answer seems to be not too dissimilar from mine. And his answer uses nice descriptive strings, rather than just ugly numbers, which I was often frustrated with. But this can be fixed with only about a dozen extra lines (but this almost doubles the line count!).

Note that we want to avoid any usage of malloc(), and I'm even a bit dubious about strcmp(). So the number of slices is never increased. And hash collisions are simply flagged it rather being resolved: the human profiler can fix this by manually bumping up the number of slices from 30, or by changing the description. Untested

static unsigned gethash(const char *str)    // "djb2", for example 
{
    unsigned c, hash = 5381;
    while ((c = *str++))
        hash = ((hash << 5) + hash) + c;    // hash * 33 + c 
    return hash;
}

void profileCpuUsage(const char *description)
{
    static struct {
        int iterations;
        double elapsedTime;
        char description[20];               // added!
    } slices[30];

    if (!description) {
        // print stats, but using description, mostly unchanged...
    }
    else {
        const int slice = gethash(description) % NUMBER(slices);
        if (!slices[slice].description[0]) { // if new slice
            assert(strlen(description) < sizeof slices[slice].description);
            strcpy(slices[slice].description, description);
        }
        else if (!!strcmp(slices[slice].description, description)) {
            strcpy(slices[slice].description, "!!hash conflict!!");
        }
        // remainder unchanged...
    }
}

And another point is that typically you'll want to disable this profiling for release versions; this also applies to ravenspoint's answer. This can be done by the trick of using an evil macro to define it away:

#define profileCpuUsage(foo)                // = nothing

If this is done, you will of course need to add parentheses to the definition to disable the disabling macro:

void (profileCpuUsage)(const char *description)...

Irtysh answered 16/6, 2010 at 7:46 Comment(0)

That's usually known as a "callgraph profile", and I'm fairly sure Visual Studio will do that.

Towel answered 16/6, 2010 at 10:51 Comment(0)

May I suggest my own open source profiler raven::set::cRunWatch? It is designed for exactly this problem and works on Windows using Visual Studio 2008 Standard Edition, so you do not need to pay for the version that has the profiler included.

I have taken your code, rearranged it slightly so it compiles without forward declarations and added the necessary calls to cRunWatch

// RunWatchDemo.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"

void highly_optimised_subroutine()
{
    raven::set::cRunWatch runwatch("highly_optimised_subroutine");
    Sleep( 2 );
}


void badly_written_function()
{
    raven::set::cRunWatch runwatch("badly_written_function");
    for (int k = 1; k < 1000; k++ )
    {
        highly_optimised_subroutine();
    }
}

void well_written_function()
{
    raven::set::cRunWatch runwatch("well_written_function");
   for (int k = 1; k < 10; k++ )
    {
        highly_optimised_subroutine();
    }
}


int _tmain(int argc, _TCHAR* argv[])
{
raven::set::cRunWatch::Start();

    well_written_function();
    badly_written_function();

raven::set::cRunWatch::Report();

    return 0;
}

When run this produces the output

raven::set::cRunWatch code timing profile
                    Scope   Calls       Mean (secs)     Total
highly_optimised_subroutine     1008    0.002921        2.944146
   badly_written_function        1      2.926662        2.926662
    well_written_function        1      0.026239        0.026239

This shows badly_written_function to be the very close second time user, and therefore the culprit.

You can obtain cRunWatch from here You will recognize the sample code in the User Guide :-)

Saphra answered 16/6, 2010 at 14:20 Comment(2)

It does require boost. Almost every program I write uses boost in one way or another. Boost is free and incredibly useful. – Saphra 16/6, 2010 at 21:11

@Joseph Quinsey Then I expect you are also missing the high-resolution performance counter, which is the beating heart of cRunWatch. – Saphra 17/6, 2010 at 1:40

Generally, this is something where you want to observe function's total time as opposed to self time, to make sure that you are looking at the time which includes the time of the called functions.

In VTune, I would recommend using Top-down tab for that. Or, even better and if you are using a latest update, try out the new experimental Caller-Callee view. You can get details on this here - http://software.intel.com/en-us/forums/topic/376210. It gets a flat list of functions with their total times so that you can view what are the top time-consuming subtrees in your program.

Guise answered 8/3, 2013 at 21:26 Comment(0)

Recommended topics

Hot tags