Performance discrepancy in compiled vs. hand-written assembly

Asked 24/8, 2014 at 11:46 Answered 25/8, 2014 at 2:30

I've been playing around with using assembly language in Go and I've written a Hamming Weight function as an excercise.

I've based a native Go version on this SO answer and the assembly version is based on this doc from AMD (page 180). Upon benchmarking the two functions, I find that the native Go version is around 1.5x - 2x faster than the assembly version, despite the hand-written assembly version being almost identical to the output from go tool 6g -S popcount.go.

output from go test -bench=.

PASS
BenchmarkPopCount       100000000               19.4 ns/op 
BenchmarkPopCount_g     200000000                8.97 ns/op
ok      popcount   4.777s

popcount.go

package popcount

func popCount(i uint32) uint32 // Defined in popcount_amd64.s

func popCount_g(i uint32) uint32 {
    i = i - ((i >> 1) & 0x55555555)
    i = (i & 0x33333333) + ((i >> 2) & 0x33333333)
    return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24
}

popcount_test.go

package popcount

import "testing"

func TestPopcount(t *testing.T) {
    for i := uint32(0); i < uint32(100); i++ {
        if popCount(i) != popCount_g(i) {
            t.Fatalf("failed on input = %v", i)
        }
    }
}

func BenchmarkPopCount(b *testing.B) {
    for i := 0; i < b.N; i++ {
        popCount(uint32(i))
    }
}

func BenchmarkPopCount_g(b *testing.B) {
    for i := 0; i < b.N; i++ {
        popCount_g(uint32(i))
    }
}

popcount_amd64.s

// func popCount(i uint32) uint32
TEXT ·popCount(SB),$0
    MOVL i+0(FP), BP        // i
    MOVL BP, BX             // i
    SHRL $1, BX             // i >> 1
    ANDL $0x055555555, BX   // (i >> 1) & 0x55555555
    SUBL BX, BP             // w = i - ((i >> 1) & 0x55555555)
    MOVL BP, AX             // w
    SHRL $2, BP             // w >> 2
    ANDL $0x033333333, AX   // w & 0x33333333
    ANDL $0x033333333, BP   // (w >> 2) & 0x33333333
    ADDL BP, AX             // x = (w & 0x33333333) + ((w >> 2) & 0x33333333)
    MOVL AX, BX             // x
    SHRL $4, BX             // x >> 4
    ADDL AX, BX             // x + (x >> 4)
    ANDL $0x00F0F0F0F, BX   // y = (x + (x >> 4) & 0x0F0F0F0F)
    IMULL $0x001010101, BX  // y * 0x01010101
    SHRL $24, BX            // population count = (y * 0x01010101) >> 24
    MOVL BX, toReturn+8(FP) // Store result.
    RET                     // return

output from go tool 6g -S popcount.go

"".popCount_g t=1 size=64 value=0 args=0x10 locals=0
        000000 00000 (popcount.go:5)    TEXT    "".popCount_g+0(SB),4,$0-16
        000000 00000 (popcount.go:5)    NOP     ,
        000000 00000 (popcount.go:5)    NOP     ,
        000000 00000 (popcount.go:5)    MOVL    "".i+8(FP),BP
        0x0004 00004 (popcount.go:5)    FUNCDATA        $2,gclocals┬À9308e7ef08d2cc2f72ae1228688dacf9+0(SB)
        0x0004 00004 (popcount.go:5)    FUNCDATA        $3,gclocals┬À3280bececceccd33cb74587feedb1f9f+0(SB)
        0x0004 00004 (popcount.go:6)    MOVL    BP,BX
        0x0006 00006 (popcount.go:6)    SHRL    $1,BX
        0x0008 00008 (popcount.go:6)    ANDL    $1431655765,BX
        0x000e 00014 (popcount.go:6)    SUBL    BX,BP
        0x0010 00016 (popcount.go:7)    MOVL    BP,AX
        0x0012 00018 (popcount.go:7)    ANDL    $858993459,AX
        0x0017 00023 (popcount.go:7)    SHRL    $2,BP
        0x001a 00026 (popcount.go:7)    ANDL    $858993459,BP
        0x0020 00032 (popcount.go:7)    ADDL    BP,AX
        0x0022 00034 (popcount.go:8)    MOVL    AX,BX
        0x0024 00036 (popcount.go:8)    SHRL    $4,BX
        0x0027 00039 (popcount.go:8)    ADDL    AX,BX
        0x0029 00041 (popcount.go:8)    ANDL    $252645135,BX
        0x002f 00047 (popcount.go:8)    IMULL   $16843009,BX
        0x0035 00053 (popcount.go:8)    SHRL    $24,BX
        0x0038 00056 (popcount.go:8)    MOVL    BX,"".~r1+16(FP)
        0x003c 00060 (popcount.go:8)    RET     ,

I know from here that the FUNCDATA lines contain information for the garbage collector, but other than that I don't see any glaring differences.

What could be causing this large difference in speed between the 2 functions?

Gavial answered 24/8, 2014 at 11:46 Comment(5)

One thing to bear in mind is that the linker in go is a lot cleverer than a traditional C linker. In particular it can transform instructions, re-order code etc, so I suggest you look at the final binaries with objdump rather than comparing the go assembler output. – Thermostat 24/8, 2014 at 11:50

@NickCraig-Wood objdump shows that the assembly is still very similar, however the hand-written assembly version shows callq 420610 <runtime.morestack00_noctxt> before calling the function. The native go version has the entire assembly in main.main so I'm guessing the difference is the time taken to allocate more memory, as the extra jump shouldn't take that long. Any thoughts? – Gavial 24/8, 2014 at 12:57

@NickCraig-Wood as per peterSO's answer, it looks like the call-overhead can cost a few ns per op. Ah well, now I need to try to learn how to inline the assembly code if I can... Down the rabbit hole I go! – Gavial 24/8, 2014 at 13:57

AFAIK it isn't possible to inline assembly code :-( It has certainly been discussed on the dev list in the past. – Thermostat 24/8, 2014 at 14:4

@NickCraig-Wood Aha, good to know before I dig too deep. Thanks again. – Gavial 24/8, 2014 at 14:5

If you look at the pseudo-assembler for the funnction calls you will see that the .s (assembler) version is called, with call overhead, and the .go version is inlined.

func S() {
    pc := popCount(uint32(0))
    _ = pc
}

"".S t=1 size=48 value=0 args=0x0 locals=0x10
    0x0000 00000 (popcount.go:11)   TEXT    "".S+0(SB),$16-0
    0x0000 00000 (popcount.go:11)   MOVQ    (TLS),CX
    0x0009 00009 (popcount.go:11)   CMPQ    SP,(CX)
    0x000c 00012 (popcount.go:11)   JHI ,21
    0x000e 00014 (popcount.go:11)   CALL    ,runtime.morestack00_noctxt(SB)
    0x0013 00019 (popcount.go:11)   JMP ,0
    0x0015 00021 (popcount.go:11)   SUBQ    $16,SP
    0x0019 00025 (popcount.go:11)   FUNCDATA    $2,gclocals·3280bececceccd33cb74587feedb1f9f+0(SB)
    0x0019 00025 (popcount.go:11)   FUNCDATA    $3,gclocals·3280bececceccd33cb74587feedb1f9f+0(SB)
    0x0019 00025 (popcount.go:12)   MOVL    $0,(SP)
    0x0020 00032 (popcount.go:12)   PCDATA  $1,$0
    0x0020 00032 (popcount.go:12)   CALL    ,"".popCount(SB)
    0x0025 00037 (popcount.go:12)   MOVL    8(SP),BX
    0x0029 00041 (popcount.go:12)   NOP ,
    0x0029 00041 (popcount.go:14)   ADDQ    $16,SP
    0x002d 00045 (popcount.go:14)   RET ,

func S_G() {
    pc := popCount_g(uint32(0))
    _ = pc
}

"".S_G t=1 size=64 value=0 args=0x0 locals=0x8
    0x0000 00000 (popcount.go:16)   TEXT    "".S_G+0(SB),4,$8-0
    0x0000 00000 (popcount.go:16)   SUBQ    $8,SP
    0x0004 00004 (popcount.go:16)   FUNCDATA    $2,gclocals·3280bececceccd33cb74587feedb1f9f+0(SB)
    0x0004 00004 (popcount.go:16)   FUNCDATA    $3,gclocals·3280bececceccd33cb74587feedb1f9f+0(SB)
    0x0004 00004 (popcount.go:17)   MOVL    $0,BP
    0x0006 00006 (popcount.go:17)   MOVL    BP,BX
    0x0008 00008 (popcount.go:17)   SHRL    $1,BX
    0x000a 00010 (popcount.go:17)   ANDL    $1431655765,BX
    0x0010 00016 (popcount.go:17)   SUBL    BX,BP
    0x0012 00018 (popcount.go:17)   MOVL    BP,AX
    0x0014 00020 (popcount.go:17)   ANDL    $858993459,AX
    0x0019 00025 (popcount.go:17)   SHRL    $2,BP
    0x001c 00028 (popcount.go:17)   ANDL    $858993459,BP
    0x0022 00034 (popcount.go:17)   ADDL    BP,AX
    0x0024 00036 (popcount.go:17)   MOVL    AX,BX
    0x0026 00038 (popcount.go:17)   SHRL    $4,BX
    0x0029 00041 (popcount.go:17)   ADDL    AX,BX
    0x002b 00043 (popcount.go:17)   ANDL    $252645135,BX
    0x0031 00049 (popcount.go:17)   IMULL   $16843009,BX
    0x0037 00055 (popcount.go:17)   SHRL    $24,BX
    0x003a 00058 (popcount.go:19)   ADDQ    $8,SP
    0x003e 00062 (popcount.go:19)   RET ,

Astomatous answered 24/8, 2014 at 13:12 Comment(1)

Thanks, you are correct (I discovered the same thing using @NickCraig-Wood 's suggestion of using objdump to examine the final binaries). – Gavial 24/8, 2014 at 13:21

If you actually want a popcnt, it's been a native instruction on X86 processors for a while now. Your compiler may have an intrinsic for it, such as Microsoft's __popcnt() (also __popcnt16() and __popcnt64() ), or GCC's __builtin_popcount() (or __builtin_popcountll() for 64 bit).

Simba answered 25/8, 2014 at 2:30 Comment(2)

Yes, I'm aware of the popcnt instruction on recent Intel and AMD processors. This was really just an exercise in learning the Plan9 based dialect of assembly that Go uses, and I came across this performance difference while testing. I have a feeling that certain instructions (such as popcnt) aren't supported by the Plan9 / Go assembly language, and there doesn't seem to be a popcnt or similar function available. – Gavial 25/8, 2014 at 2:40

It's not supported last I checked, but you can encode the instruction yourself and include the literal bytes in hex format in the assembly, e.g. BYTE $0xf3; BYTE $0x48; BYTE $0x0f; BYTE $0xb8; BYTE $0x16 // POPCNTQ (SI), DX – Schizogenesis 11/7, 2015 at 0:28

Recommended topics

Hot tags