what's the difference between __builtin_popcountll and_mm_popcnt_u64?
Asked Answered
C

3

12

I was trying to how many 1 in 512MB memory and I found two possible methods, _mm_popcnt_u64() and __builtin_popcountll() in the gcc builtins.

_mm_popcnt_u64() is said to use the CPU introduction SSE4.2,which seems to be the fastest, and __builtin_popcountll() is excepted to use table lookup.

So, I think __builtin_popcountll() should be little slower than _mm_popcnt_u64().

However I got a result like this:

Test result

It took almost the same time for two methods. I highly doubt that they used the same way to work.

I also got this in popcntintrin.h

/* Calculate a number of bits set to 1. */
extern __inline int __attribute__((__gnu_inline__, __always_inline__, __artificial___))
_mm_popcnt_u32 (unsigned int __X)
{
  return __builtin_popcount (__X);
}

#ifdef __x86_64__
extern __inline long long __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_popcnt_u64 (unsigned long long __X)
{
  return __builtin_popcountll (__X);
}
#endif

So, I'm confused how __builtin_popcountll() works on earth

Conidiophore answered 30/6, 2016 at 3:2 Comment(2)
The timings of both are quite similar (nearly equal) on my Skylake CPU.Lucialucian
popcount is unlikely to use table lookup, as its simpler ans faster to implement it with just basic arithmetic.Claar
T
20

_mm_popcnt_u64 is part of <nmmintrin.h>, a header devised by Intel for utility functions for accessing SSE 4.2 instructions.

__builtin_popcountll is a GCC extension.

_mm_popcnt_u64 is portable to non-GNU compilers, and __builtin_popcountll is portable to non-SSE-4.2 CPUs. But on systems where both are available, both should compile to the exact same code.

Tonl answered 13/6, 2017 at 16:19 Comment(0)
J
1

If You compile without march flag, so with x86_64 default, builtin should be slower because it needs to dispatch function selecting between different architectures. This will cause no inlining and additional condition.

Jeannajeanne answered 18/11, 2019 at 20:22 Comment(0)
W
0

Besides any other consideration you cannot just time a loop and use the raw number for any meaningful benchmark.

As a rule of thumb, if it doesn't have error bars (variance) it is not a proper measure of anything. Next time try running your benchmarks 10 times (or 1000) each, compute the average and standard deviations, and make sure one of the results is better/worse than the other with high statistical confidence, i.e. > 99.9%.

https://en.wikipedia.org/wiki/Standard_deviation#Estimation

And as a side note, a 0.1% difference in a benchmark should usually be considered statistical noise, especially if you are measuring CPU instrinsics or any other function that takes under 100 cycles to execute.

Wigwag answered 26/9, 2023 at 22:9 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.