Can I change this macro to an inline function without a performance hit?

Asked 22/11, 2011 at 2:34 Answered 22/11, 2011 at 5:29

Solved c++performance macros inline sqrt

(EDIT: Let's title this, "Lessons in how measurements can go wrong." I still haven't figured out exactly what's causing the discrepancy though.)

I found a very fast integer square root function here by Mark Crowne. At least with GCC on my machine, it's clearly the fastest integer square root function I've tested (including the functions in Hacker's Delight, this page, and floor(sqrt()) from the standard library).

After cleaning up the formatting a bit, renaming a variable, and using fixed-width types, it looks like this:

static uint32_t mcrowne_isqrt(uint32_t val)
{
    uint32_t temp, root = 0;

    if (val >= 0x40000000)
    {
        root = 0x8000;
        val -= 0x40000000;
    }

    #define INNER_ISQRT(s)                              \
    do                                                  \
    {                                                   \
        temp = (root << (s)) + (1 << ((s) * 2 - 2));    \
        if (val >= temp)                                \
        {                                               \
            root += 1 << ((s)-1);                       \
            val -= temp;                                \
        }                                               \
    } while(0)

    INNER_ISQRT(15);
    INNER_ISQRT(14);
    INNER_ISQRT(13);
    INNER_ISQRT(12);
    INNER_ISQRT(11);
    INNER_ISQRT(10);
    INNER_ISQRT( 9);
    INNER_ISQRT( 8);
    INNER_ISQRT( 7);
    INNER_ISQRT( 6);
    INNER_ISQRT( 5);
    INNER_ISQRT( 4);
    INNER_ISQRT( 3);
    INNER_ISQRT( 2);

    #undef INNER_ISQRT

    temp = root + root + 1;
    if (val >= temp)
        root++;
    return root;
}

The INNER_ISQRT macro isn't too evil, since it's local and immediately undefined after it's no longer needed. Nevertheless, I'd still like to convert it to an inline function as a matter of principle. I've read assertions in several places (including the GCC documentation) that inline functions are "just as fast" as macros, but I've had trouble converting it without a speed hit.

My current iteration looks like this (note the always_inline attribute, which I threw in for good measure):

static inline void inner_isqrt(const uint32_t s, uint32_t& val, uint32_t& root) __attribute__((always_inline));
static inline void inner_isqrt(const uint32_t s, uint32_t& val, uint32_t& root)
{
    const uint32_t temp = (root << s) + (1 << ((s << 1) - 2));
    if(val >= temp)
    {
        root += 1 << (s - 1);
        val -= temp;
    }
}

//  Note that I just now changed the name to mcrowne_inline_isqrt, so people can compile my full test.
static uint32_t mcrowne_inline_isqrt(uint32_t val)
{
    uint32_t root = 0;

    if(val >= 0x40000000)
    {
        root = 0x8000; 
        val -= 0x40000000;
    }

    inner_isqrt(15, val, root);
    inner_isqrt(14, val, root);
    inner_isqrt(13, val, root);
    inner_isqrt(12, val, root);
    inner_isqrt(11, val, root);
    inner_isqrt(10, val, root);
    inner_isqrt(9, val, root);
    inner_isqrt(8, val, root);
    inner_isqrt(7, val, root);
    inner_isqrt(6, val, root);
    inner_isqrt(5, val, root);
    inner_isqrt(4, val, root);
    inner_isqrt(3, val, root);
    inner_isqrt(2, val, root);

    const uint32_t temp = root + root + 1;
    if (val >= temp)
        root++;
    return root;
}

No matter what I do, the inline function is always slower than the macro. The macro version commonly times at around 2.92s for (2^28 - 1) iterations with an -O2 build, whereas the inline version commonly times at around 3.25s. EDIT: I said 2^32 - 1 iterations before, but I forgot that I had changed it. They take quite a bit longer for the full gamut.

It's possible that the compiler is just being stupid and refusing to inline it (note again the always_inline attribute!), but if so, that would make the macro version generally preferable anyway. (I tried checking the assembly to see, but it was too complicated as part of a program. The optimizer omitted everything when I tried compiling just the functions of course, and I'm having issues compiling it as a library due to noobishness with GCC.)

In short, is there a way to write this as an inline without a speed hit? (I haven't profiled, but sqrt is one of those fundamental operations that should always be made fast, since I may be using it in many other programs than just the one I'm currently interested in. Besides, I'm just curious.)

I've even tried using templates to "bake in" the constant value, but I get the feeling the other two parameters are more likely to be causing the hit (and the macro can avoid that, since it uses local variables directly)...well, either that or the compiler is stubbornly refusing to inline.

UPDATE: user1034749 below is getting the same assembly output from both functions when he puts them in separate files and compiles them. I tried his exact command line, and I'm getting the same result as him. For all intents and purposes, this question is solved.

However, I'd still like to know why my measurements are coming out differently. Obviously, my measurement code or original build process was causing things to be different. I'll post the code below. Does anyone know what the deal was? Maybe my compiler is actually inlining the whole mcrowne_isqrt() function in the loop of my main() function, but it's not inlining the entirety of the other version?

UPDATE 2 (squeezed in before testing code): Note that if I swap the order of the tests and make the inline version come first, the inline version comes out faster than the macro version by the same amount. Is this a caching issue, or is the compiler inlining one call but not the other, or what?

#include <iostream>
#include <time.h>      //  Linux high-resolution timer
#include <stdint.h>

/*  Functions go here */

timespec timespecdiff(const timespec& start, const timespec& end)
{
    timespec elapsed;
    timespec endmod = end;
    if(endmod.tv_nsec < start.tv_nsec)
    {
        endmod.tv_sec -= 1;
        endmod.tv_nsec += 1000000000;
    }

    elapsed.tv_sec = endmod.tv_sec - start.tv_sec;
    elapsed.tv_nsec = endmod.tv_nsec - start.tv_nsec;
    return elapsed;
}


int main()
{
    uint64_t inputlimit = 4294967295;
    //  Test a wide range of values
    uint64_t widestep = 16;

    timespec start, end;

    //  Time macro version:
    uint32_t sum = 0;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
    for(uint64_t num = (widestep - 1); num <= inputlimit; num += widestep)
    {
        sum += mcrowne_isqrt(uint32_t(num));
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    timespec markcrowntime = timespecdiff(start, end);
    std::cout << "Done timing Mark Crowne's sqrt variant.  Sum of results = " << sum << " (to avoid over-optimization)." << std::endl;


    //  Time inline version:
    sum = 0;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
    for(uint64_t num = (widestep - 1); num <= inputlimit; num += widestep)
    {
        sum += mcrowne_inline_isqrt(uint32_t(num));
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    timespec markcrowninlinetime = timespecdiff(start, end);
    std::cout << "Done timing Mark Crowne's inline sqrt variant.  Sum of results = " << sum << " (to avoid over-optimization)." << std::endl;

    //  Results:
    std::cout << "Mark Crowne sqrt variant time:\t" << markcrowntime.tv_sec << "s, " << markcrowntime.tv_nsec << "ns" << std::endl;
    std::cout << "Mark Crowne inline sqrt variant time:\t" << markcrowninlinetime.tv_sec << "s, " << markcrowninlinetime.tv_nsec << "ns" << std::endl;
    std::cout << std::endl;
}

UPDATE 3: I still have no idea how to reliably compare the timing of different functions without the timing depending on the order of the tests. I'd greatly appreciate any tips!

However, if anyone else reading this is interested in fast sqrt implementations, I should mention: Mark Crowne's code tests faster than any other pure C/C++ version I've tried by a decent margin (despite reliability issues with testing), but the following SSE code seems like it might be a little bit faster still for a scalar 32-bit integer sqrt. It can't be generalized for full-blown 64-bit unsigned integer inputs without losing precision though (and the first signed conversion would also have to be replaced by a load intrinsic to handle values >= 2^63):

uint32_t sse_sqrt(uint64_t num)
{
    //  Uses 64-bit input, because SSE conversion functions treat all
    //  integers as signed (so conversion from a 32-bit value >= 2^31
    //  will be interpreted as negative).  As it stands, this function
    //  will similarly fail for values >= 2^63.
    //  It can also probably be made faster, since it generates a strange/
    //  useless movsd %xmm0,%xmm0 instruction before the sqrtsd.  It clears
    //  xmm0 first too with xorpd (seems unnecessary, but I could be wrong).
    __m128d result;
    __m128d num_as_sse_double = _mm_cvtsi64_sd(result, num);
    result = _mm_sqrt_sd(num_as_sse_double, num_as_sse_double);
    return _mm_cvttsd_si32(result);
}

Amperehour answered 22/11, 2011 at 2:34 Comment(13)

Just a small point, but temp should be declared const in both functions. – Unpeg 22/11, 2011 at 2:45

Try passing it using template arguments. – Holbert 22/11, 2011 at 2:52

Good point, Kerrek SB. I didn't even think about that. Changing that didn't have any effect, but I'm updating the post anyway. – Amperehour 22/11, 2011 at 2:55

Thanks for the idea Pubby, but I already tried passing the constants using a template parameter, and it had no effect. That's what I meant by my last line about trying to "bake in" the constant value using templates. – Amperehour 22/11, 2011 at 3:0

inline doesn't guarantee that the compiler makes it inline. The compiler can still choose not to. You can force some compilers to put it inline with something like __forceinline on VC++ or __attribute__((always_inline)) in gcc – Regenerative 22/11, 2011 at 3:3

As you can see, I actually did use __attribute__((always_inline)) in the declaration, just in case. ;) I'm still not sure that the compiler isn't trying to fight me though, which is why I made a feeble attempt at checking the assembly (to no avail). – Amperehour 22/11, 2011 at 3:5

"The INNER_ISQRT macro isn't too evil, since it's local and immediately undefined after it's no longer needed." #define's aren't scoped... – Mccrory 22/11, 2011 at 3:56

#define's may not be scoped, but the #undef seems to effectively mimic local scoping in this case. Due to its presence, I can't access INNER_ISQRT from outside the function where it's defined. There might be some subtle exception I'm missing, but that's going off on a tangent anyway. After all, the point of my question is still about finding a way to avoid the macro in the first place (without any performance hit). – Amperehour 22/11, 2011 at 4:46

(Aside from the #undef though, my comment about the macro being "local" was mostly about the proximity of its usage to its implementation. Macros just seem a bit less evil when their definition is close at hand instead of buried in some arcane header that heavily abuses the preprocessor.) – Amperehour 22/11, 2011 at 5:1

I am quite surprised that this would be faster than the floor(sqrt(x)) call. I guess the back and forth int -> float -> int is the culprit. – Nippy 22/11, 2011 at 7:17

@ Mike S - oops, didn't scroll right to see the attribute__((always_inline)). I have seen the macro be faster, but that was with long functions with many variables. Analyzing the assembly I figured the compiler (VC++ 2010) was juggling the registers differently when I used inline rather than define, although I was a bit surprised there was a difference too! – Regenerative 22/11, 2011 at 9:44

Matthieu, I'm sure you're right about the conversions being the likely culprit. Load-hit-stores are nasty, and they're the whole reason I went looking for integer-specific square root functions. The odd timing discrepancy between the inline/macro version of Mark Crowne's isqrt was small in my testing, but floor(sqrt()) takes nearly three times as long. (Note that I haven't timed the SSE sqrtss intrinsic yet though.) – Amperehour 22/11, 2011 at 14:49

ARRRRRRGH. Curious, I changed the order of the floor(sqrt()) test to put it first...and now it "only" takes 50% longer than mcrowne_isqrt rather than 2-3 times as long. I REALLY need to figure out a more reliable way of testing these functions! – Amperehour 22/11, 2011 at 15:4

I tried your code with gcc 4.5.3. I modified your second version of code to match the first one, for example:

(1 << ((s) * 2 - 2)

(1 << ((s << 1) - 1)

yes, s * 2 == s << 1, but "-2" and "-1"?

Also I modified your types replace uint32_t with "unsigned long", because of on my 64 bit machine "long" is not 32bit number.

And then I run:

g++ -ggdb -O2 -march=native -c -pipe inline.cpp
g++ -ggdb -O2 -march=native -c -pipe macros.cpp
objdump -d inline.o > inline.s
objdump -d macros.o > macros.s

I could use "-S" instead of "-c" to assembler, but I would like to see assembler without additional info.

and you know what?
The assembler completly the same, in the first and in the second verison. So I think your time measurements are just wrong.

Fimbriation answered 22/11, 2011 at 5:29 Comment(9)

Oops, the - 1 was a total typo. I must have done it in a rush, but you're right about the code being flat-out wrong. Fixed! I'll match the types of the two functions just to be sure, but so far I'm getting the same measurements as before using Linux's high-resolution timer (over 2^32 - 1 iterations). – Amperehour 22/11, 2011 at 13:26

Okay, I modified the types of the original function to be uint32_t (rather than making the inline function match the original), because this code only works for 32-bit inputs anyway. A 64-bit sqrt would require a few changes, so technically speaking the ambiguous "unsigned long" is the wrong type to use for a 64-bit machine. Anyway, I'm measuring them exactly the same as before. I'll post my measurement code above and modify the type of the original function to boot. – Amperehour 22/11, 2011 at 13:35

Oops, small correction: I've tested the full gamut as well, but the times I gave were for (2^28 - 1) iterations, not (2^32 - 1). Anyway, I fixed the functions and posted my timing code above, but I'm still getting the same timing difference. I'll try with your build options, since you're getting the same assembly code output. – Amperehour 22/11, 2011 at 13:54

I just compiled each function in a separate file using your options (thank you for those btw; I'm a noob with GCC and usually use the build chain from my IDE), and...you're right. The two functions compile to the exact same code. This is quite different from compiling them together as part of a program. My best guess is that my test program is inlining the mcrowne_isqrt() call in the loop in my main function, but it's not inlining the mcrown_inline_isqrt() call. Are there any other possibilities? – Amperehour 22/11, 2011 at 14:7

what version of gcc do you use? – Fimbriation 22/11, 2011 at 14:19

Sonofa...I think the problem is my measurement code. I just swapped the order of the tests, and now the inline version is faster than the macro version by the same amount. Could this be a caching issue? It seems unlikely given the huge number of iterations though. I guess the only other possibility is that the compiler is inlining one of the sqrt calls but not the other, depending on the order they're called in... – Amperehour 22/11, 2011 at 14:26

Just a guess: The kernel scheduler may count your cpu usage, and because of you eat CPU as a hog it reduce your dynamic priority and you get less amount of CPU at the end of your test, so the last algorithm, not depend which one, show the worse result. – Fimbriation 22/11, 2011 at 14:32

I set the clock to CLOCK_PROCESS_CPUTIME_ID to head off issues like that, but could they still be creeping in? If they are, it seems like that would wreak havoc with testing of all kinds. I'm searching for a command to force a certain priority from process startup to completion, but I haven't found it yet (everything I've found is about altering the priority of already running processes). Would you know offhand? – Amperehour 22/11, 2011 at 14:43

I tried running with the highest possible priority using "nice -n -20 ./infuriatingtest." It didn't affect anything though, and the discrepancies are still there based on test ordering. It could still be a scheduling issue if certain tests are being interrupted more than others or for longer, but CLOCK_PROCESS_CPUTIME_ID should still be taking care of that AFAIK...hrm. Any other ideas? – Amperehour 22/11, 2011 at 16:41

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags