How to generate a sse4.2 popcnt machine instruction
Asked Answered
C

3

14

Using the c program:

int main(int argc , char** argv)
{

  return  __builtin_popcountll(0xf0f0f0f0f0f0f0f0);

}

and the compiler line (gcc 4.4 - Intel Xeon L3426):

gcc -msse4.2 poptest.c -o poptest

I do NOT get the builtin popcnt insruction rather the compiler generates a lookup table and computes the popcount that way. The resulting binary is over 8000 bytes. (Yuk!)

Thanks so much for any assistance.

Cathrine answered 21/6, 2011 at 15:2 Comment(1)
gcc since at least 4.4.7 (oldest on godbolt) enables -mpopcnt as part of -msse4.2, even though they have separate CPUID feature bits. godbolt.org/g/SfcHYh. Also, if you __builtin_popcountll(argc), your program won't optimize to return 32 when you enable optimization. Or just look at asm for a function with an int arg, since you just want to look at asm, not run it. However, -march=native is by far the best choice if you're going to run your binary locally, since it sets -mtune as well as enabling instructions.Incompetent
T
26

You have to tell GCC to generate code for an architecture that supports the popcnt instruction:

gcc -march=corei7 popcnt.c

Or just enable support for popcnt:

gcc -mpopcnt popcnt.c

In your example program the parameter to __builtin_popcountll is a constant so the compiler will probably do the calculation at compile time and never emit the popcnt instruction. GCC does this even if not asked to optimize the program.

So try passing it something that it can't know at compile time:

int main (int argc, char** argv)
{
    return  __builtin_popcountll ((long long) argv);
}

$ gcc -march=corei7 -O popcnt.c && objdump -d a.out | grep '<main>' -A 2
0000000000400454 <main>:
  400454:       f3 48 0f b8 c6          popcnt %rsi,%rax
  400459:       c3                      retq
Thorn answered 3/11, 2012 at 18:10 Comment(0)
S
4

You need to do it like this:

#include <stdio.h>
#include <smmintrin.h>

int main(void)
{
    int pop = _mm_popcnt_u64(0xf0f0f0f0f0f0f0f0ULL);
    printf("pop = %d\n", pop);
    return 0;
}

$ gcc -Wall -m64 -msse4.2 popcnt.c -o popcnt
$ ./popcnt 
pop = 32
$ 

EDIT

Oops - I just checked the disassembly output with gcc 4.2 and ICC 11.1 - while ICC 11.1 correctly generates popcntl or popcntq, for some reason gcc does not - it calls ___popcountdi2 instead. Weird. I will try a newer version of gcc when I get a chance and see if it's fixed. I guess the only workaround otherwise is to use ICC instead of gcc.

Sang answered 21/6, 2011 at 15:27 Comment(4)
Thanks so much Paul for investigating this. Your code using (gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3) still generates a large lookup table. I'm going to try installing icc. Great tip there!Cathrine
I just tried gcc 4.4.6 from MacPorts and that seems to generate popcnt instructions OK, so it looks like this may have been fixed somewhere between 4.4.3 and 4.4.6.Sang
So a friend of mine chimed in that I need to use objdump not x86dis to look for the popcnt instruction. When using your program and objdump I see: 400533: f3 0f b8 45 f8 popcnt -0x8(%rbp),%eax so I think I am all good now. Thanks again so very much.Cathrine
OK - I was using gcc -S to generate asm source and looking at that. BTW, you might still want to consider ICC though, if your application is performance-critical. Good luck !Sang
O
2

For __builtin_popcountll in GCC, all you need to do is add -mpopcnt

#include <stdlib.h>
int main(int argc, char **argv) {
    return __builtin_popcountll(atoi(argv[1]));
}

with -mpopcnt

$ otool -tvV a.out
a.out:
(__TEXT,__text) section
_main:
0000000100000f66    pushq   %rbp
0000000100000f67    movq    %rsp, %rbp
0000000100000f6a    subq    $0x10, %rsp
0000000100000f6e    movq    %rdi, -0x8(%rbp)
0000000100000f72    movq    -0x8(%rbp), %rax
0000000100000f76    addq    $0x8, %rax
0000000100000f7a    movq    (%rax), %rax
0000000100000f7d    movq    %rax, %rdi
0000000100000f80    callq   0x100000f8e ## symbol stub for: _atoi
0000000100000f85    cltq
0000000100000f87    popcntq %rax, %rax
0000000100000f8c    leave
0000000100000f8d    retq

without -mpopcnt

a.out:
(__TEXT,__text) section
_main:
0000000100000f55    pushq   %rbp
0000000100000f56    movq    %rsp, %rbp
0000000100000f59    subq    $0x10, %rsp
0000000100000f5d    movq    %rdi, -0x8(%rbp)
0000000100000f61    movq    -0x8(%rbp), %rax
0000000100000f65    addq    $0x8, %rax
0000000100000f69    movq    (%rax), %rax
0000000100000f6c    movq    %rax, %rdi
0000000100000f6f    callq   0x100000f86 ## symbol stub for: _atoi
0000000100000f74    cltq
0000000100000f76    movq    %rax, %rdi
0000000100000f79    callq   0x100000f80 ## symbol stub for: ___popcountdi2
0000000100000f7e    leave
0000000100000f7f    retq

Notes

Be sure to check the ABM bit (bit 23) of CPUID feature bits before using POPCNTQ

Outlet answered 15/9, 2016 at 2:37 Comment(2)
You could just show disassembly for a function taking an int arg; it would be much shorter and clearer than this. Also, I think you forgot to enable optimization, because -O2 enables -fomit-frame-pointer.Incompetent
Oh lord. Perfect is the enemy of the shipped.Outlet

© 2022 - 2024 — McMap. All rights reserved.