I've been playing around with using assembly language in Go and I've written a Hamming Weight function as an excercise.
I've based a native Go version on this SO answer and the assembly version is based on this doc from AMD (page 180). Upon benchmarking the two functions, I find that the native Go version is around 1.5x - 2x faster than the assembly version, despite the hand-written assembly version being almost identical to the output from go tool 6g -S popcount.go
.
output from go test -bench=.
PASS
BenchmarkPopCount 100000000 19.4 ns/op
BenchmarkPopCount_g 200000000 8.97 ns/op
ok popcount 4.777s
popcount.go
package popcount
func popCount(i uint32) uint32 // Defined in popcount_amd64.s
func popCount_g(i uint32) uint32 {
i = i - ((i >> 1) & 0x55555555)
i = (i & 0x33333333) + ((i >> 2) & 0x33333333)
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24
}
popcount_test.go
package popcount
import "testing"
func TestPopcount(t *testing.T) {
for i := uint32(0); i < uint32(100); i++ {
if popCount(i) != popCount_g(i) {
t.Fatalf("failed on input = %v", i)
}
}
}
func BenchmarkPopCount(b *testing.B) {
for i := 0; i < b.N; i++ {
popCount(uint32(i))
}
}
func BenchmarkPopCount_g(b *testing.B) {
for i := 0; i < b.N; i++ {
popCount_g(uint32(i))
}
}
popcount_amd64.s
// func popCount(i uint32) uint32
TEXT ·popCount(SB),$0
MOVL i+0(FP), BP // i
MOVL BP, BX // i
SHRL $1, BX // i >> 1
ANDL $0x055555555, BX // (i >> 1) & 0x55555555
SUBL BX, BP // w = i - ((i >> 1) & 0x55555555)
MOVL BP, AX // w
SHRL $2, BP // w >> 2
ANDL $0x033333333, AX // w & 0x33333333
ANDL $0x033333333, BP // (w >> 2) & 0x33333333
ADDL BP, AX // x = (w & 0x33333333) + ((w >> 2) & 0x33333333)
MOVL AX, BX // x
SHRL $4, BX // x >> 4
ADDL AX, BX // x + (x >> 4)
ANDL $0x00F0F0F0F, BX // y = (x + (x >> 4) & 0x0F0F0F0F)
IMULL $0x001010101, BX // y * 0x01010101
SHRL $24, BX // population count = (y * 0x01010101) >> 24
MOVL BX, toReturn+8(FP) // Store result.
RET // return
output from go tool 6g -S popcount.go
"".popCount_g t=1 size=64 value=0 args=0x10 locals=0
000000 00000 (popcount.go:5) TEXT "".popCount_g+0(SB),4,$0-16
000000 00000 (popcount.go:5) NOP ,
000000 00000 (popcount.go:5) NOP ,
000000 00000 (popcount.go:5) MOVL "".i+8(FP),BP
0x0004 00004 (popcount.go:5) FUNCDATA $2,gclocals┬À9308e7ef08d2cc2f72ae1228688dacf9+0(SB)
0x0004 00004 (popcount.go:5) FUNCDATA $3,gclocals┬À3280bececceccd33cb74587feedb1f9f+0(SB)
0x0004 00004 (popcount.go:6) MOVL BP,BX
0x0006 00006 (popcount.go:6) SHRL $1,BX
0x0008 00008 (popcount.go:6) ANDL $1431655765,BX
0x000e 00014 (popcount.go:6) SUBL BX,BP
0x0010 00016 (popcount.go:7) MOVL BP,AX
0x0012 00018 (popcount.go:7) ANDL $858993459,AX
0x0017 00023 (popcount.go:7) SHRL $2,BP
0x001a 00026 (popcount.go:7) ANDL $858993459,BP
0x0020 00032 (popcount.go:7) ADDL BP,AX
0x0022 00034 (popcount.go:8) MOVL AX,BX
0x0024 00036 (popcount.go:8) SHRL $4,BX
0x0027 00039 (popcount.go:8) ADDL AX,BX
0x0029 00041 (popcount.go:8) ANDL $252645135,BX
0x002f 00047 (popcount.go:8) IMULL $16843009,BX
0x0035 00053 (popcount.go:8) SHRL $24,BX
0x0038 00056 (popcount.go:8) MOVL BX,"".~r1+16(FP)
0x003c 00060 (popcount.go:8) RET ,
I know from here that the FUNCDATA
lines contain information for the garbage collector, but other than that I don't see any glaring differences.
What could be causing this large difference in speed between the 2 functions?
callq 420610 <runtime.morestack00_noctxt>
before calling the function. The native go version has the entire assembly inmain.main
so I'm guessing the difference is the time taken to allocate more memory, as the extra jump shouldn't take that long. Any thoughts? – Gavial