Array vs Slice: accessing speed

Asked 29/5, 2015 at 8:49 Answered 11/4 at 11:13

Solved arrays performance go benchmarking slice

This question is about the speed of accessing elements of arrays and slices, not about the efficiency of passing them to functions as arguments.

I would expect arrays to be faster than slices in most cases because a slice is a data structure describing a contiguous section of an array and so there may be an extra step involved when accessing elements of a slice (indirectly the elements of its underlying array).

So I wrote a little test to benchmark a batch of simple operations. There are 4 benchmark functions, the first 2 test a global slice and a global array, the other 2 test a local slice and a local array:

var gs = make([]byte, 1000) // Global slice
var ga [1000]byte           // Global array

func BenchmarkSliceGlobal(b *testing.B) {
    for i := 0; i < b.N; i++ {
        for j, v := range gs {
            gs[j]++; gs[j] = gs[j] + v + 10; gs[j] += v
        }
    }
}

func BenchmarkArrayGlobal(b *testing.B) {
    for i := 0; i < b.N; i++ {
        for j, v := range ga {
            ga[j]++; ga[j] = ga[j] + v + 10; ga[j] += v
        }
    }
}

func BenchmarkSliceLocal(b *testing.B) {
    var s = make([]byte, 1000)
    for i := 0; i < b.N; i++ {
        for j, v := range s {
            s[j]++; s[j] = s[j] + v + 10; s[j] += v
        }
    }
}

func BenchmarkArrayLocal(b *testing.B) {
    var a [1000]byte
    for i := 0; i < b.N; i++ {
        for j, v := range a {
            a[j]++; a[j] = a[j] + v + 10; a[j] += v
        }
    }
}

I ran the test multiple times, here is the typical output (go test -bench .*):

BenchmarkSliceGlobal      300000              4210 ns/op
BenchmarkArrayGlobal      300000              4123 ns/op
BenchmarkSliceLocal       500000              3090 ns/op
BenchmarkArrayLocal       500000              3768 ns/op

Analyzing the results:

Accessing the global slice is slightly slower than accessing the global array which is as I expected:
4210 vs 4123 ns/op

But accessing the local slice is significantly faster than accessing the local array:
3090 vs 3768 ns/op

My question is: What is the reason for this?

Notes

I tried varying the following things but none changed the outcome:

the size of the array/slice (tried 100, 1000, 10000)
the order of the benchmark functions
the element type of the array/slice (tried byte and int)

Enameling answered 29/5, 2015 at 8:49 Comment(2)

There are several factors which influence such micro benchmarks: Bounds checking, allocation (heap, stack), code generation, peephole optimizations, etc. Have a look at the generated assembly output under different conditions like disabled bound checking or disabled optimizations. – Penuchle 29/5, 2015 at 9:28

Interestingly enough, if you add a pointer to array to the benchmarks, you'll see that they perform about the same as slices. – Potluck 29/5, 2015 at 10:9

Comparing the amd64 assembly of both BenchmarkArrayLocal and BenchmarkSliceLocal (too long to fit in this post):

The array version loads the address of a from memory multiple times, practically on every array-access operation:

LEAQ    "".a+1000(SP),BX

Whereas the slice version is computing exclusively on registers after loading once from memory:

LEAQ    (DX)(SI*1),BX

This is not conclusive but probably the cause. Reason being that both methods are otherwise virtually identical. One other notable detail is that the array version calls into runtime.duffcopy, which is a quite long assembly routine, whereas the slice version doesn't.

Mcclees answered 29/5, 2015 at 9:37 Comment(7)

But runtime.duffcopy() is only invoked once right? Seeing that the benchmark cycle is executed half a million times (N), this shouldn't have any effect on the result. – Enameling 29/5, 2015 at 9:44

Yes, that's what I think as well. – Mcclees 29/5, 2015 at 9:46

Does the almost same result in case of global array and slice come from the fact that in that case the compiled code does not include the "registry" optimization and does the same address-loading from memory? – Enameling 29/5, 2015 at 9:52

Yes, full assembly here. Both global versions load from memory just like the local array version does. – Mcclees 29/5, 2015 at 9:57

Thanks, very helpful. I wonder why no register optimization is used in case of accessing local array in a loop... – Enameling 29/5, 2015 at 10:7

@Enameling looks like a oversight to me. May be it make sense to raise an issue for Go compiler. – Adagietto 19/9, 2016 at 9:22

Those instructions are both LEAs (ALU shift-and-add), the first one adding a sizable offset relative to the stack pointer, the other just adding two registers. a in that case has to be an assemble-time constant, probably one the compiler defined for human readability of access to locals on the stack. – Hau 11/5, 2022 at 17:6

Go version 1.8 can eliminate some range checks so the difference got bigger.

BenchmarkSliceGlobal-4 500000 3220 ns/op BenchmarkArrayGlobal-4 1000000 1287 ns/op BenchmarkSliceLocal-4 1000000 1267 ns/op BenchmarkArrayLocal-4 1000000 1301 ns/op

For arrays I'd recommend to use sizes from powers of two and include a logical and operation. In that way you're sure the compiler eliminates the check. Thus var ga [1024]byte with ga[j & 1023].

Luttrell answered 12/5, 2017 at 16:44 Comment(0)

on go1.18 and M1 the difference much bigger i was sure that array is faster then slice but now i have proof it's not always the case

goos: darwin
goarch: arm64
BenchmarkSliceGlobal-8        926196          1257.0 ns/op         0 B/op          0 allocs/op
BenchmarkArrayGlobal-8       2110324           567.0 ns/op         0 B/op          0 allocs/op
BenchmarkSliceLocal-8        2275382           535.0 ns/op         0 B/op          0 allocs/op
BenchmarkArrayLocal-8        1802491           647.4 ns/op         0 B/op          0 allocs/op```

Fleetwood answered 11/5, 2022 at 16:40 Comment(0)

-1

my results go1.22.2

goos: windows
goarch: amd64
cpu: 12th Gen Intel(R) Core(TM) i5-12500H
BenchmarkSliceGlobal-16       966373          1240 ns/op
BenchmarkArrayGlobal-16      2321634           511.5 ns/op
BenchmarkSliceLocal-16       4257007           289.0 ns/op
BenchmarkArrayLocal-16       2431124           478.2 ns/op

Prothalamion answered 11/4 at 11:13 Comment(1)

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. – Iconography 16/4 at 3:52

Recommended topics

Hot tags