System.arraycopy with constant length

# JMH 1.11 (released 17 days ago) # VM version: JDK 1.8.0_05, VM 25.5-b02 # VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home/jre/bin/java # VM options: -Dfile.encoding=UTF-8 -Duser.country=FR -Duser.language=fr -Duser.variant # Warmup: 20 iterations, 1 s each # Measurement: 20 iterations, 1 s each # Timeout: 10 min per iteration # Threads: 1 thread, will synchronize iterations # Benchmark mode: Throughput, ops/time Benchmark Mode Cnt Score Error Units ArrayCopyBenchmark.ArraysCopyOf thrpt 20 67100500,319 ± 455252,537 ops/s ArrayCopyBenchmark.ArraysCopyOf_Class thrpt 20 65246374,290 ± 976481,330 ops/s ArrayCopyBenchmark.ArraysCopyOf_Class_ConstantSize thrpt 20 65068143,162 ± 1597390,531 ops/s ArrayCopyBenchmark.ArraysCopyOf_ConstantSize thrpt 20 64463603,462 ± 953946,811 ops/s ArrayCopyBenchmark.Clone thrpt 20 64837239,393 ± 834353,404 ops/s ArrayCopyBenchmark.Loop thrpt 20 21070422,097 ± 112595,764 ops/s ArrayCopyBenchmark.Loop_ConstantSize thrpt 20 24458867,274 ± 181486,291 ops/s ArrayCopyBenchmark.SystemArrayCopy thrpt 20 66688368,490 ± 582416,954 ops/s ArrayCopyBenchmark.SystemArrayCopy_ConstantSize thrpt 20 48992312,357 ± 298807,039 ops/s

import java.util.Arrays; import java.util.concurrent.TimeUnit; import org.openjdk.jmh.annotations.Benchmark; import org.openjdk.jmh.annotations.BenchmarkMode; import org.openjdk.jmh.annotations.Mode; import org.openjdk.jmh.annotations.OutputTimeUnit; import org.openjdk.jmh.annotations.Scope; import org.openjdk.jmh.annotations.Setup; import org.openjdk.jmh.annotations.State; @State(Scope.Benchmark) @BenchmarkMode(Mode.Throughput) @OutputTimeUnit(TimeUnit.SECONDS) public class ArrayCopyBenchmark { private static final int LENGTH = 32; private Object[] array; @Setup public void before() { array = new Object[LENGTH]; for (int i = 0; i < LENGTH; i++) { array[i] = new Object(); } } @Benchmark public Object[] Clone() { Object[] src = this.array; return src.clone(); } @Benchmark public Object[] ArraysCopyOf() { Object[] src = this.array; return Arrays.copyOf(src, src.length); } @Benchmark public Object[] ArraysCopyOf_ConstantSize() { Object[] src = this.array; return Arrays.copyOf(src, LENGTH); } @Benchmark public Object[] ArraysCopyOf_Class() { Object[] src = this.array; return Arrays.copyOf(src, src.length, Object[].class); } @Benchmark public Object[] ArraysCopyOf_Class_ConstantSize() { Object[] src = this.array; return Arrays.copyOf(src, LENGTH, Object[].class); } @Benchmark public Object[] SystemArrayCopy() { Object[] src = this.array; int length = src.length; Object[] array = new Object[length]; System.arraycopy(src, 0, array, 0, length); return array; } @Benchmark public Object[] SystemArrayCopy_ConstantSize() { Object[] src = this.array; Object[] array = new Object[LENGTH]; System.arraycopy(src, 0, array, 0, LENGTH); return array; } @Benchmark public Object[] Loop() { Object[] src = this.array; int length = src.length; Object[] array = new Object[length]; for (int i = 0; i < length; i++) { array[i] = src[i]; } return array; } @Benchmark public Object[] Loop_ConstantSize() { Object[] src = this.array; Object[] array = new Object[LENGTH]; for (int i = 0; i < LENGTH; i++) { array[i] = src[i]; } return array; } }

As usual, these kind of questions are quickly answered by studying the generated code. JMH provides you with -prof perfasm on Linux, and -prof xperfasm on Windows. If you run the benchmark on JDK 8u40, then you will see (note I used -bm avgt -tu ns to make scores more comprehensible):

Benchmark                         Mode  Cnt   Score   Error  Units
ACB.SystemArrayCopy               avgt   25  13.294 ± 0.052  ns/op
ACB.SystemArrayCopy_ConstantSize  avgt   25  16.413 ± 0.080  ns/op

Why are these benchmarks perform differently? Let's first do -prof perfnorm to dissect (I dropped the lines that do not matter):

Benchmark                                     Mode  Cnt    Score    Error  Units
ACB.SAC                                       avgt   25   13.466 ±  0.070  ns/op
ACB.SAC:·CPI                                  avgt    5    0.602 ±  0.025   #/op
ACB.SAC:·L1-dcache-load-misses                avgt    5    2.346 ±  0.239   #/op
ACB.SAC:·L1-dcache-loads                      avgt    5   24.756 ±  1.438   #/op
ACB.SAC:·L1-dcache-store-misses               avgt    5    2.404 ±  0.129   #/op
ACB.SAC:·L1-dcache-stores                     avgt    5   14.929 ±  0.230   #/op
ACB.SAC:·LLC-loads                            avgt    5    2.151 ±  0.217   #/op
ACB.SAC:·branches                             avgt    5   17.795 ±  1.003   #/op
ACB.SAC:·cycles                               avgt    5   56.677 ±  3.187   #/op
ACB.SAC:·instructions                         avgt    5   94.145 ±  6.442   #/op

ACB.SAC_ConstantSize                          avgt   25   16.447 ±  0.084  ns/op
ACB.SAC_ConstantSize:·CPI                     avgt    5    0.637 ±  0.016   #/op
ACB.SAC_ConstantSize:·L1-dcache-load-misses   avgt    5    2.357 ±  0.206   #/op
ACB.SAC_ConstantSize:·L1-dcache-loads         avgt    5   25.611 ±  1.482   #/op
ACB.SAC_ConstantSize:·L1-dcache-store-misses  avgt    5    2.368 ±  0.123   #/op
ACB.SAC_ConstantSize:·L1-dcache-stores        avgt    5   25.593 ±  1.610   #/op
ACB.SAC_ConstantSize:·LLC-loads               avgt    5    1.050 ±  0.038   #/op
ACB.SAC_ConstantSize:·branches                avgt    5   17.853 ±  0.697   #/op
ACB.SAC_ConstantSize:·cycles                  avgt    5   66.680 ±  2.049   #/op
ACB.SAC_ConstantSize:·instructions            avgt    5  104.759 ±  4.831   #/op

So, ConstantSize somehow does more L1-dcache-stores, but one less LLC-load. Hm, so that's what we are looking for, more stores in the constant case. -prof perfasm conveniently highlights the hot parts in assembly:

default:

  4.32%    6.36%   0x00007f7714bda2dc: movq   $0x1,(%rax)            ; alloc
  0.09%    0.04%   0x00007f7714bda2e3: prefetchnta 0x100(%r9)
  2.95%    1.48%   0x00007f7714bda2eb: movl   $0xf80022a9,0x8(%rax)
  0.38%    0.18%   0x00007f7714bda2f2: mov    %r11d,0xc(%rax)
  1.56%    3.02%   0x00007f7714bda2f6: prefetchnta 0x140(%r9)
  4.73%    2.71%   0x00007f7714bda2fe: prefetchnta 0x180(%r9)

ConstantSize:

  0.58%    1.22%   0x00007facf921132b: movq   $0x1,(%r14)            ; alloc
  0.84%    0.72%   0x00007facf9211332: prefetchnta 0xc0(%r10)
  0.11%    0.13%   0x00007facf921133a: movl   $0xf80022a9,0x8(%r14)
  0.21%    0.68%   0x00007facf9211342: prefetchnta 0x100(%r10)
  0.50%    0.87%   0x00007facf921134a: movl   $0x20,0xc(%r14)
  0.53%    0.82%   0x00007facf9211352: mov    $0x10,%ecx
  0.04%    0.14%   0x00007facf9211357: xor    %rax,%rax
  0.34%    0.76%   0x00007facf921135a: shl    $0x3,%rcx
  0.50%    1.17%   0x00007facf921135e: rex.W rep stos %al,%es:(%rdi) ; zeroing
 29.49%   52.09%   0x00007facf9211361: prefetchnta 0x140(%r10)
  1.03%    0.53%   0x00007facf9211369: prefetchnta 0x180(%r10)

So there is that pesky rex.W rep stos %al,%es:(%rdi) consuming a significant time. This zeroes the newly allocated array. In ConstantSize test, the JVM could not correlate that you are overwriting the entire target array, and so it had to pre-zero it before diving into the actual array copy.

If you look at the generated code on JDK 9b82 (the latest available), then you will see it folds both patterns in non-zeroed copy, as you can see with -prof perfasm, and can also confirm with -prof perfnorm:

Benchmark                                     Mode  Cnt    Score    Error  Units
ACB.SAC                                       avgt   50   14.156 ±  0.492  ns/op
ACB.SAC:·CPI                                  avgt    5    0.612 ±  0.144   #/op
ACB.SAC:·L1-dcache-load-misses                avgt    5    2.363 ±  0.341   #/op
ACB.SAC:·L1-dcache-loads                      avgt    5   28.350 ±  2.181   #/op
ACB.SAC:·L1-dcache-store-misses               avgt    5    2.287 ±  0.607   #/op
ACB.SAC:·L1-dcache-stores                     avgt    5   16.922 ±  3.402   #/op
ACB.SAC:·branches                             avgt    5   21.242 ±  5.914   #/op
ACB.SAC:·cycles                               avgt    5   67.168 ± 20.950   #/op
ACB.SAC:·instructions                         avgt    5  109.931 ± 35.905   #/op

ACB.SAC_ConstantSize                          avgt   50   13.763 ±  0.067  ns/op
ACB.SAC_ConstantSize:·CPI                     avgt    5    0.625 ±  0.024   #/op
ACB.SAC_ConstantSize:·L1-dcache-load-misses   avgt    5    2.376 ±  0.214   #/op
ACB.SAC_ConstantSize:·L1-dcache-loads         avgt    5   28.285 ±  2.127   #/op
ACB.SAC_ConstantSize:·L1-dcache-store-misses  avgt    5    2.335 ±  0.223   #/op
ACB.SAC_ConstantSize:·L1-dcache-stores        avgt    5   16.926 ±  1.467   #/op
ACB.SAC_ConstantSize:·branches                avgt    5   19.469 ±  0.869   #/op
ACB.SAC_ConstantSize:·cycles                  avgt    5   62.395 ±  3.898   #/op
ACB.SAC_ConstantSize:·instructions            avgt    5   99.891 ±  5.435   #/op

Of course, all these nanobenchmarks for arraycopy are susceptible for weird alignment-induced performance differences in the vectorized copying stubs, but that's another (horror) story, that I don't have courage to tell.

Recommended topics

Hot tags