Different results between Debug and Release
Asked Answered
E

5

5

I have the problem that my code returns different results when comparing debug to release. I checked that both modes use /fp:precise, so that should not be the problem. The main issue I have with this is that the complete image analysis (its an image understanding project) is completely deterministic, there's absolutely nothing random in it.

Another issue with this is the fact that my release build actually always returns the same result (23.014 for the image), while debug returns some random value between 22 and 23, which just should not be. I've already checked whether it may be thread related, but the only part in the algorithm which is multi-threaded returns the precisely same result for both debug and release.

What else may be happening here?

Update1: The code I now found responsible for this behaviour:

float PatternMatcher::GetSADFloatRel(float* sample, float* compared, int sampleX, int compX, int offX)
{
    if (sampleX != compX)
    {
        return 50000.0f;
    }
    float result = 0;

    float* pTemp1 = sample;
    float* pTemp2 = compared + offX;

    float w1 = 0.0f;
    float w2 = 0.0f;
    float w3 = 0.0f;

    for(int j = 0; j < sampleX; j ++)
    {
        w1 += pTemp1[j] * pTemp1[j];
        w2 += pTemp1[j] * pTemp2[j];
        w3 += pTemp2[j] * pTemp2[j];
    }               
    float a = w2 / w3;
    result = w3 * a * a - 2 * w2 * a + w1;
    return result / sampleX;
}

Update2: This is not reproducible with 32bit code. While debug and release code will always result in the same value for 32bit, it still is different from the 64bit release version, and the 64bit debug still returns some completely random values.

Update3: Okay, I found it to certainly be caused by OpenMP. When I disable it, it works fine. (both Debug and Release use the same code, and both have OpenMP activated).

Following is the code giving me trouble:

#pragma omp parallel for shared(last, bestHit, cVal, rad, veneOffset)
for(int r = 0; r < 53; ++r)
{
    for(int k = 0; k < 3; ++k)
    {
        for(int c = 0; c < 30; ++c)
        {
            for(int o = -1; o <= 1; ++o)
            {
                /*
                r: 2.0f - 15.0f, in 53 steps, representing the radius of blood vessel
                c: 0-29, in steps of 1, representing the absorption value (collagene)
                iO: 0-2, depending on current radius. Signifies a subpixel offset (-1/3, 0, 1/3)
                o: since we are not sure we hit the middle, move -1 to 1 pixels along the samples
                */

                int offset = r * 3 * 61 * 30 + k * 30 * 61 + c * 61 + o + (61 - (4*w+1))/2;

                if(offset < 0 || offset == fSamples.size())
                {
                    continue;
                }
                last = GetSADFloatRel(adapted, &fSamples.at(offset), 4*w+1, 4*w+1, 0);
                if(bestHit > last)
                {
                    bestHit = last;
                    rad = (r+8)*0.25f;
                    cVal = c * 2;
                    veneOffset =(-0.5f + (1.0f / 3.0f) * k + (1.0f / 3.0f) / 2.0f);
                    if(fabs(veneOffset) < 0.001)
                        veneOffset = 0.0f;
                }
                last = GetSADFloatRel(input, &fSamples.at(offset), w * 4 + 1, w * 4 + 1, 0);
                if(bestHit > last)
                {
                    bestHit = last;
                    rad = (r+8)*0.25f;
                    cVal = c * 2;
                    veneOffset = (-0.5f + (1.0f / 3.0f) * k + (1.0f / 3.0f) / 2.0f);
                    if(fabs(veneOffset) < 0.001)
                        veneOffset = 0.0f;
                }
            }
        }
    }
}

Note: with Release mode and OpenMP activated I get the same result as with deactivating OpenMP. Debug mode and OpenMP activated gets a different result, OpenMP deactivated gets the same result as with Release.

Englis answered 14/8, 2012 at 14:14 Comment(9)
We might be able to help more if we see some code. In general, my guess is that you're using loose syntax somewhere that the normal compiler understands properly, but the debugger doesn't.Dragone
use valgrind to check if you have some memory corruption which may cause non deterministic behavior.Turnbull
Interesting. The usual Heisenbug situation is that debugging gets more reliable results.Chateau
Smells like undefined behaviour...Abdulabdulla
Release and debug are just different sets of project options - you can change options one by one until you find the ones that make your Release output match your Debug output. But we don't have enough info to tell you what's going on. Print out intermediate output, divide and conquer... 8 - )Cristiecristin
@AlexanderChertov: Adding intermediate output will likely change the results as it forces an ordering of operations onto the compiler. Insert a printf statement and the problem may go away; take it back out and the problem returns.Conchoid
@Nathan, I tend to disagree. If he inserts printfs in the middle of his calculation loops then yes, something might get reordered. But if he calls 10 numerical routines and checks the input/output then this approach may help him find out which of the routines give different results under debug and under release. If you're stuck you can try narrowing the problem down...Cristiecristin
@AlexanderChertov Hmmm...see Felix von Leitner's extensive presentation on the actual assembly produced by various c compilers (PDF link!). Modern compilers can and will heavily manipulate your code.Chateau
You have many unsynchronised accesses to shared variables inside the parallel region, last and bestHit being the most obvious ones. This calls for problems when the code is run.Vanquish
V
7

To elaborate on my comment, this is the code that is most probably the root of your problem:

#pragma omp parallel for shared(last, bestHit, cVal, rad, veneOffset)
{
    ...
    last = GetSADFloatRel(adapted, &fSamples.at(offset), 4*w+1, 4*w+1, 0);
    if(bestHit > last)
    {

last is only assigned to before it is read again so it is a good candidate for being a lastprivate variable, if you really need the value from the last iteration outside the parallel region. Otherwise just make it private.

Access to bestHit, cVal, rad, and veneOffset should be synchronised by a critical region:

#pragma omp critical
if (bestHit > last)
{
    bestHit = last;
    rad = (r+8)*0.25f;
    cVal = c * 2;
    veneOffset =(-0.5f + (1.0f / 3.0f) * k + (1.0f / 3.0f) / 2.0f);
    if(fabs(veneOffset) < 0.001)
        veneOffset = 0.0f;
}

Note that by default all variables, except the counters of parallel for loops and those defined inside the parallel region, are shared, i.e. the shared clause in your case does nothing unless you also apply the default(none) clause.

Another thing that you should be aware of is that in 32-bit mode Visual Studio uses x87 FPU math while in 64-bit mode it uses SSE math by default. x87 FPU does intermediate calculations using 80-bit floating point precision (even for calculations involving float only) while the SSE unit supports only the standard IEEE single and double precisions. Introducing OpenMP or any other parallelisation technique to a 32-bit x87 FPU code means that at certain points intermediate values should be converted back to the single precision of float and if done sufficiently many times a slight or significant difference (depending on the numerical stability of the algorithm) could be observed between the results from the serial code and the parallel one.

Based on your code, I would suggest that the following modified code would give you good parallel performance because there is no synchronisation at each iteration:

#pragma omp parallel private(last)
{
    int rBest = 0, kBest = 0, cBest = 0;
    float myBestHit = bestHit;

    #pragma omp for
    for(int r = 0; r < 53; ++r)
    {
        for(int k = 0; k < 3; ++k)
        {
            for(int c = 0; c < 30; ++c)
            {
                for(int o = -1; o <= 1; ++o)
                {
                    /*
                    r: 2.0f - 15.0f, in 53 steps, representing the radius of blood vessel
                    c: 0-29, in steps of 1, representing the absorption value (collagene)
                    iO: 0-2, depending on current radius. Signifies a subpixel offset (-1/3, 0, 1/3)
                    o: since we are not sure we hit the middle, move -1 to 1 pixels along the samples
                    */

                    int offset = r * 3 * 61 * 30 + k * 30 * 61 + c * 61 + o + (61 - (4*w+1))/2;

                    if(offset < 0 || offset == fSamples.size())
                    {
                        continue;
                    }
                    last = GetSADFloatRel(adapted, &fSamples.at(offset), 4*w+1, 4*w+1, 0);
                    if(myBestHit > last)
                    {
                        myBestHit = last;
                        rBest = r;
                        cBest = c;
                        kBest = k;
                    }
                    last = GetSADFloatRel(input, &fSamples.at(offset), w * 4 + 1, w * 4 + 1, 0);
                    if(myBestHit > last)
                    {
                        myBestHit = last;
                        rBest = r;
                        cBest = c;
                        kBest = k;
                    }
                }
            }
        }
    }
    #pragma omp critical
    if (bestHit > myBestHit)
    {
        bestHit = myBestHit;
        rad = (rBest+8)*0.25f;
        cVal = cBest * 2;
        veneOffset =(-0.5f + (1.0f / 3.0f) * kBest + (1.0f / 3.0f) / 2.0f);
        if(fabs(veneOffset) < 0.001)
        veneOffset = 0.0f;
    }
}

It only stores the values of the parameters that give the best hit in each thread and then at the end of the parallel region it computes rad, cVal and veneOffset based on the best values. Now there is only one critical region, and it is at the end of code. You can get around it also, but you would have to introduce additional arrays.

Vanquish answered 15/8, 2012 at 13:0 Comment(6)
Thanks, declaring last as private did it, now I get the same results between release and debug mode!Englis
@AntonRoth, did you also add the critical sections? Without them you get no guarantee that data races won't occurr.Vanquish
Yes, I did, but for 20 tries, it never made any difference regarding the result though. Actually the performance WITH the #pragma omp critical is a lot worse than having it single threaded in the first place.Englis
Yes, critical sections add synchronisation overhead. What you can do is only store the value of last, r, c and k that give the best hit in each thread in a shared array (do it at the end of the parallel region; the array should have one element per thread; make bestHist private), then outside the parallel region examine the array and compute rad, cVal and veneOffset based on the values from the thread that has the best bestHit value.Vanquish
@AntonRoth, I've added a sample code of how you can get around synchonising the access to the shared variables at each iteration. Note that "for 20 tries, it never made any difference regarding the result" is different from "it would NEVER give different result".Vanquish
Ah, nice. I've always used manual threading, and am not really too familiar with the OpenMP threading. Thanks a lot!Englis
C
13

At least two possibilities:

  1. Turning on optimization may result in the compiler reordering operations. This can introduce small differences in floating-point calculations when compared to the order executed in debug mode, where operation reordering does not occur. This may account for numerical differences between debug and release, but does not account for numerical differences from one run to the next in debug mode.
  2. You have a memory-related bug in your code, such as reading/writing past the bounds of an array, using an uninitialized variable, using an unallocated pointer, etc. Try running it through a memory checker, such as the excellent Valgrind, to identify such problems. Memory related errors may account for non-deterministic behavior.

If you are on Windows, then Valgrind isn't available (pity), but you can look here for a list of alternatives.

Conchoid answered 14/8, 2012 at 14:18 Comment(6)
I have turned off Optimization completely now in Release mode, and now I get the same random results in Release mode. Why would the full optimization result in a deterministic result, while Debug gives me some random return value?Englis
The first thing I check when encountering non-deterministic behavior (and I'm not using random numbers) is memory errors. They are a giant pain to track down without the right tools (I used to spent days finding them before I had proper memory debugging tools).Conchoid
@AntonRoth It's usually the reverse, but it's possible that the optimizer eliminates certain calculations because it "knows" the results, where as without optimization it doesn't. And if those calculations use an uninitialized value somewhere...Smog
@AntonRoth Another possibility is that some code is ill-behaved and has unintended side-effects. Reordering the operations probably doesn't eliminate the side-effects, but it may move them to a point in the calculation where they aren't detrimental to the result.Conchoid
I now ran the application in the ApplicationVerifier from Microsoft, and it said 0 errors, 0 warnings. Interesting thing: running it with 32bit results again in a different value (23.009), but this time deterministic for both debug and release mode.Englis
@AntonRoth There is certainly nothing in the code you posted that should lead to non-determinism in debug mode when called repeatedly with the same inputs. There must be something outside of this loop that's causing non-determinism. I would verify that the input is the same on all calls. Going from debug to optimized, the compiler may unroll your loop, which will cause the w* add operations to be reordered and result in some amount of floating point value differences, but not non-determinism.Conchoid
V
7

To elaborate on my comment, this is the code that is most probably the root of your problem:

#pragma omp parallel for shared(last, bestHit, cVal, rad, veneOffset)
{
    ...
    last = GetSADFloatRel(adapted, &fSamples.at(offset), 4*w+1, 4*w+1, 0);
    if(bestHit > last)
    {

last is only assigned to before it is read again so it is a good candidate for being a lastprivate variable, if you really need the value from the last iteration outside the parallel region. Otherwise just make it private.

Access to bestHit, cVal, rad, and veneOffset should be synchronised by a critical region:

#pragma omp critical
if (bestHit > last)
{
    bestHit = last;
    rad = (r+8)*0.25f;
    cVal = c * 2;
    veneOffset =(-0.5f + (1.0f / 3.0f) * k + (1.0f / 3.0f) / 2.0f);
    if(fabs(veneOffset) < 0.001)
        veneOffset = 0.0f;
}

Note that by default all variables, except the counters of parallel for loops and those defined inside the parallel region, are shared, i.e. the shared clause in your case does nothing unless you also apply the default(none) clause.

Another thing that you should be aware of is that in 32-bit mode Visual Studio uses x87 FPU math while in 64-bit mode it uses SSE math by default. x87 FPU does intermediate calculations using 80-bit floating point precision (even for calculations involving float only) while the SSE unit supports only the standard IEEE single and double precisions. Introducing OpenMP or any other parallelisation technique to a 32-bit x87 FPU code means that at certain points intermediate values should be converted back to the single precision of float and if done sufficiently many times a slight or significant difference (depending on the numerical stability of the algorithm) could be observed between the results from the serial code and the parallel one.

Based on your code, I would suggest that the following modified code would give you good parallel performance because there is no synchronisation at each iteration:

#pragma omp parallel private(last)
{
    int rBest = 0, kBest = 0, cBest = 0;
    float myBestHit = bestHit;

    #pragma omp for
    for(int r = 0; r < 53; ++r)
    {
        for(int k = 0; k < 3; ++k)
        {
            for(int c = 0; c < 30; ++c)
            {
                for(int o = -1; o <= 1; ++o)
                {
                    /*
                    r: 2.0f - 15.0f, in 53 steps, representing the radius of blood vessel
                    c: 0-29, in steps of 1, representing the absorption value (collagene)
                    iO: 0-2, depending on current radius. Signifies a subpixel offset (-1/3, 0, 1/3)
                    o: since we are not sure we hit the middle, move -1 to 1 pixels along the samples
                    */

                    int offset = r * 3 * 61 * 30 + k * 30 * 61 + c * 61 + o + (61 - (4*w+1))/2;

                    if(offset < 0 || offset == fSamples.size())
                    {
                        continue;
                    }
                    last = GetSADFloatRel(adapted, &fSamples.at(offset), 4*w+1, 4*w+1, 0);
                    if(myBestHit > last)
                    {
                        myBestHit = last;
                        rBest = r;
                        cBest = c;
                        kBest = k;
                    }
                    last = GetSADFloatRel(input, &fSamples.at(offset), w * 4 + 1, w * 4 + 1, 0);
                    if(myBestHit > last)
                    {
                        myBestHit = last;
                        rBest = r;
                        cBest = c;
                        kBest = k;
                    }
                }
            }
        }
    }
    #pragma omp critical
    if (bestHit > myBestHit)
    {
        bestHit = myBestHit;
        rad = (rBest+8)*0.25f;
        cVal = cBest * 2;
        veneOffset =(-0.5f + (1.0f / 3.0f) * kBest + (1.0f / 3.0f) / 2.0f);
        if(fabs(veneOffset) < 0.001)
        veneOffset = 0.0f;
    }
}

It only stores the values of the parameters that give the best hit in each thread and then at the end of the parallel region it computes rad, cVal and veneOffset based on the best values. Now there is only one critical region, and it is at the end of code. You can get around it also, but you would have to introduce additional arrays.

Vanquish answered 15/8, 2012 at 13:0 Comment(6)
Thanks, declaring last as private did it, now I get the same results between release and debug mode!Englis
@AntonRoth, did you also add the critical sections? Without them you get no guarantee that data races won't occurr.Vanquish
Yes, I did, but for 20 tries, it never made any difference regarding the result though. Actually the performance WITH the #pragma omp critical is a lot worse than having it single threaded in the first place.Englis
Yes, critical sections add synchronisation overhead. What you can do is only store the value of last, r, c and k that give the best hit in each thread in a shared array (do it at the end of the parallel region; the array should have one element per thread; make bestHist private), then outside the parallel region examine the array and compute rad, cVal and veneOffset based on the values from the thread that has the best bestHit value.Vanquish
@AntonRoth, I've added a sample code of how you can get around synchonising the access to the shared variables at each iteration. Note that "for 20 tries, it never made any difference regarding the result" is different from "it would NEVER give different result".Vanquish
Ah, nice. I've always used manual threading, and am not really too familiar with the OpenMP threading. Thanks a lot!Englis
C
6

One thing to double check is that all variables are initialized. Many times un-optimized code (Debug mode) will initialize memory.

Calends answered 14/8, 2012 at 14:20 Comment(0)
M
2

I would have said variable initialization in debug vs not there in release. But your results would not back this up (reliable result in release).

Does your code rely on any specific offsets or sizes? Debug build would place guards bytes around some allocations.

Could it be floating point related?

The debug floating point stack is different to the release which is built for more efficiency.

Look here: http://thetweaker.wordpress.com/2009/08/28/debugrelease-numerical-differences/

Mele answered 14/8, 2012 at 14:26 Comment(0)
S
2

Just about any undefined behavior can account for this: uninitialized variables, rogue pointers, multiple modifications of the same object without an intervening sequence point, etc. etc. The fact that the results are at times unreproduceable argues somewhat for an uninitialized variable, but it can also occur from pointer problems or bounds errors.

Be aware that optimization can change results, especially on an Intel. Optimization can change which intermediate values spill to memory, and if you've not carefully used parentheses, even the order of evaluation in an expression. (And as we all know, in machine floating point, (a + b) + c) != a + (b + c).) Still the results should be deterministic: you will get different results according to the degree of optimization, but for any set of optimization flags, you should get the same results.

Smog answered 14/8, 2012 at 14:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.