System.arraycopy with constant length
Asked Answered
L

1

14

I'm playing around with JMH ( http://openjdk.java.net/projects/code-tools/jmh/ ) and I just stumbled on a strange result.

I'm benchmarking ways to make a shallow copy of an array and I can observe the expected results (that looping through the array is a bad idea and that there is no significant difference between #clone(), System#arraycopy() and Arrays#copyOf(), performance-wise).

Except that System#arraycopy() is one-quarter slower when the array's length is hard-coded... Wait, what ? How can this be slower ?

Does anyone has an idea of what could be the cause ?

The results (throughput):

# JMH 1.11 (released 17 days ago)
# VM version: JDK 1.8.0_05, VM 25.5-b02
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home/jre/bin/java
# VM options: -Dfile.encoding=UTF-8 -Duser.country=FR -Duser.language=fr -Duser.variant
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time

Benchmark                                            Mode  Cnt         Score         Error  Units
ArrayCopyBenchmark.ArraysCopyOf                     thrpt   20  67100500,319 ±  455252,537  ops/s
ArrayCopyBenchmark.ArraysCopyOf_Class               thrpt   20  65246374,290 ±  976481,330  ops/s
ArrayCopyBenchmark.ArraysCopyOf_Class_ConstantSize  thrpt   20  65068143,162 ± 1597390,531  ops/s
ArrayCopyBenchmark.ArraysCopyOf_ConstantSize        thrpt   20  64463603,462 ±  953946,811  ops/s
ArrayCopyBenchmark.Clone                            thrpt   20  64837239,393 ±  834353,404  ops/s
ArrayCopyBenchmark.Loop                             thrpt   20  21070422,097 ±  112595,764  ops/s
ArrayCopyBenchmark.Loop_ConstantSize                thrpt   20  24458867,274 ±  181486,291  ops/s
ArrayCopyBenchmark.SystemArrayCopy                  thrpt   20  66688368,490 ±  582416,954  ops/s
ArrayCopyBenchmark.SystemArrayCopy_ConstantSize     thrpt   20  48992312,357 ±  298807,039  ops/s

And the benchmark class:

import java.util.Arrays;
import java.util.concurrent.TimeUnit;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;

@State(Scope.Benchmark)
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
public class ArrayCopyBenchmark {

    private static final int LENGTH = 32;

    private Object[] array;

    @Setup
    public void before() {
        array = new Object[LENGTH];
        for (int i = 0; i < LENGTH; i++) {
            array[i] = new Object();
        }
    }

    @Benchmark
    public Object[] Clone() {
        Object[] src = this.array;
        return src.clone();
    }

    @Benchmark
    public Object[] ArraysCopyOf() {
        Object[] src = this.array;
        return Arrays.copyOf(src, src.length);
    }

    @Benchmark
    public Object[] ArraysCopyOf_ConstantSize() {
        Object[] src = this.array;
        return Arrays.copyOf(src, LENGTH);
    }

    @Benchmark
    public Object[] ArraysCopyOf_Class() {
        Object[] src = this.array;
        return Arrays.copyOf(src, src.length, Object[].class);
    }

    @Benchmark
    public Object[] ArraysCopyOf_Class_ConstantSize() {
        Object[] src = this.array;
        return Arrays.copyOf(src, LENGTH, Object[].class);
    }

    @Benchmark
    public Object[] SystemArrayCopy() {
        Object[] src = this.array;
        int length = src.length;
        Object[] array = new Object[length];
        System.arraycopy(src, 0, array, 0, length);
        return array;
    }

    @Benchmark
    public Object[] SystemArrayCopy_ConstantSize() {
        Object[] src = this.array;
        Object[] array = new Object[LENGTH];
        System.arraycopy(src, 0, array, 0, LENGTH);
        return array;
    }

    @Benchmark
    public Object[] Loop() {
        Object[] src = this.array;
        int length = src.length;
        Object[] array = new Object[length];
        for (int i = 0; i < length; i++) {
            array[i] = src[i];
        }
        return array;
    }

    @Benchmark
    public Object[] Loop_ConstantSize() {
        Object[] src = this.array;
        Object[] array = new Object[LENGTH];
        for (int i = 0; i < LENGTH; i++) {
            array[i] = src[i];
        }
        return array;
    }
}
Lucie answered 29/9, 2015 at 2:45 Comment(7)
weird, I thought Arrays.copyof internally called System.arraycopy meaning it should have a lower throughput by definition, how many times did you try running this benchmark?Sagerman
Try increasing your warmup. 20 is not enough to kickstart JIT.Fredericksburg
@Sagerman 5 times. It didn't change the results much.Lucie
@Fredericksburg These are the default settings for JMH. It's 20 iterations of 1s, each resulting in (judging the results) tens of millions of calls.Lucie
I tried changing the order the benchmarks are run, updated the JVM to a more recent (1.8.0_60), removing all benchmarks except for the System#arraycopy() ones. I still get the same results.Lucie
Win7 x64, Java 1.8.0_u60, result is not confirmed (all CopyOf*, Clone and both ArrayCopy* tests show the thoughput like 27M/s on my laptop). Probably the result is Mac-specific...Joachima
@TagirValeev Maybe. I don't have a Win or Linux box to test this on.Lucie
G
10

As usual, these kind of questions are quickly answered by studying the generated code. JMH provides you with -prof perfasm on Linux, and -prof xperfasm on Windows. If you run the benchmark on JDK 8u40, then you will see (note I used -bm avgt -tu ns to make scores more comprehensible):

Benchmark                         Mode  Cnt   Score   Error  Units
ACB.SystemArrayCopy               avgt   25  13.294 ± 0.052  ns/op
ACB.SystemArrayCopy_ConstantSize  avgt   25  16.413 ± 0.080  ns/op

Why are these benchmarks perform differently? Let's first do -prof perfnorm to dissect (I dropped the lines that do not matter):

Benchmark                                     Mode  Cnt    Score    Error  Units
ACB.SAC                                       avgt   25   13.466 ±  0.070  ns/op
ACB.SAC:·CPI                                  avgt    5    0.602 ±  0.025   #/op
ACB.SAC:·L1-dcache-load-misses                avgt    5    2.346 ±  0.239   #/op
ACB.SAC:·L1-dcache-loads                      avgt    5   24.756 ±  1.438   #/op
ACB.SAC:·L1-dcache-store-misses               avgt    5    2.404 ±  0.129   #/op
ACB.SAC:·L1-dcache-stores                     avgt    5   14.929 ±  0.230   #/op
ACB.SAC:·LLC-loads                            avgt    5    2.151 ±  0.217   #/op
ACB.SAC:·branches                             avgt    5   17.795 ±  1.003   #/op
ACB.SAC:·cycles                               avgt    5   56.677 ±  3.187   #/op
ACB.SAC:·instructions                         avgt    5   94.145 ±  6.442   #/op

ACB.SAC_ConstantSize                          avgt   25   16.447 ±  0.084  ns/op
ACB.SAC_ConstantSize:·CPI                     avgt    5    0.637 ±  0.016   #/op
ACB.SAC_ConstantSize:·L1-dcache-load-misses   avgt    5    2.357 ±  0.206   #/op
ACB.SAC_ConstantSize:·L1-dcache-loads         avgt    5   25.611 ±  1.482   #/op
ACB.SAC_ConstantSize:·L1-dcache-store-misses  avgt    5    2.368 ±  0.123   #/op
ACB.SAC_ConstantSize:·L1-dcache-stores        avgt    5   25.593 ±  1.610   #/op
ACB.SAC_ConstantSize:·LLC-loads               avgt    5    1.050 ±  0.038   #/op
ACB.SAC_ConstantSize:·branches                avgt    5   17.853 ±  0.697   #/op
ACB.SAC_ConstantSize:·cycles                  avgt    5   66.680 ±  2.049   #/op
ACB.SAC_ConstantSize:·instructions            avgt    5  104.759 ±  4.831   #/op

So, ConstantSize somehow does more L1-dcache-stores, but one less LLC-load. Hm, so that's what we are looking for, more stores in the constant case. -prof perfasm conveniently highlights the hot parts in assembly:

default:

  4.32%    6.36%   0x00007f7714bda2dc: movq   $0x1,(%rax)            ; alloc
  0.09%    0.04%   0x00007f7714bda2e3: prefetchnta 0x100(%r9)
  2.95%    1.48%   0x00007f7714bda2eb: movl   $0xf80022a9,0x8(%rax)
  0.38%    0.18%   0x00007f7714bda2f2: mov    %r11d,0xc(%rax)
  1.56%    3.02%   0x00007f7714bda2f6: prefetchnta 0x140(%r9)
  4.73%    2.71%   0x00007f7714bda2fe: prefetchnta 0x180(%r9)

ConstantSize:

  0.58%    1.22%   0x00007facf921132b: movq   $0x1,(%r14)            ; alloc
  0.84%    0.72%   0x00007facf9211332: prefetchnta 0xc0(%r10)
  0.11%    0.13%   0x00007facf921133a: movl   $0xf80022a9,0x8(%r14)
  0.21%    0.68%   0x00007facf9211342: prefetchnta 0x100(%r10)
  0.50%    0.87%   0x00007facf921134a: movl   $0x20,0xc(%r14)
  0.53%    0.82%   0x00007facf9211352: mov    $0x10,%ecx
  0.04%    0.14%   0x00007facf9211357: xor    %rax,%rax
  0.34%    0.76%   0x00007facf921135a: shl    $0x3,%rcx
  0.50%    1.17%   0x00007facf921135e: rex.W rep stos %al,%es:(%rdi) ; zeroing
 29.49%   52.09%   0x00007facf9211361: prefetchnta 0x140(%r10)
  1.03%    0.53%   0x00007facf9211369: prefetchnta 0x180(%r10)  

So there is that pesky rex.W rep stos %al,%es:(%rdi) consuming a significant time. This zeroes the newly allocated array. In ConstantSize test, the JVM could not correlate that you are overwriting the entire target array, and so it had to pre-zero it before diving into the actual array copy.

If you look at the generated code on JDK 9b82 (the latest available), then you will see it folds both patterns in non-zeroed copy, as you can see with -prof perfasm, and can also confirm with -prof perfnorm:

Benchmark                                     Mode  Cnt    Score    Error  Units
ACB.SAC                                       avgt   50   14.156 ±  0.492  ns/op
ACB.SAC:·CPI                                  avgt    5    0.612 ±  0.144   #/op
ACB.SAC:·L1-dcache-load-misses                avgt    5    2.363 ±  0.341   #/op
ACB.SAC:·L1-dcache-loads                      avgt    5   28.350 ±  2.181   #/op
ACB.SAC:·L1-dcache-store-misses               avgt    5    2.287 ±  0.607   #/op
ACB.SAC:·L1-dcache-stores                     avgt    5   16.922 ±  3.402   #/op
ACB.SAC:·branches                             avgt    5   21.242 ±  5.914   #/op
ACB.SAC:·cycles                               avgt    5   67.168 ± 20.950   #/op
ACB.SAC:·instructions                         avgt    5  109.931 ± 35.905   #/op

ACB.SAC_ConstantSize                          avgt   50   13.763 ±  0.067  ns/op
ACB.SAC_ConstantSize:·CPI                     avgt    5    0.625 ±  0.024   #/op
ACB.SAC_ConstantSize:·L1-dcache-load-misses   avgt    5    2.376 ±  0.214   #/op
ACB.SAC_ConstantSize:·L1-dcache-loads         avgt    5   28.285 ±  2.127   #/op
ACB.SAC_ConstantSize:·L1-dcache-store-misses  avgt    5    2.335 ±  0.223   #/op
ACB.SAC_ConstantSize:·L1-dcache-stores        avgt    5   16.926 ±  1.467   #/op
ACB.SAC_ConstantSize:·branches                avgt    5   19.469 ±  0.869   #/op
ACB.SAC_ConstantSize:·cycles                  avgt    5   62.395 ±  3.898   #/op
ACB.SAC_ConstantSize:·instructions            avgt    5   99.891 ±  5.435   #/op

Of course, all these nanobenchmarks for arraycopy are susceptible for weird alignment-induced performance differences in the vectorized copying stubs, but that's another (horror) story, that I don't have courage to tell.

Gestation answered 30/9, 2015 at 14:6 Comment(3)
Thank you for this great answer. I suspected that the JVM couldn't infer that I was using the full length of the array, but I didn't know how to prove it.Lucie
I tried to duplicate your results, but I think the perfasm/xperfasm is not available for Mac OSX.Lucie
Yes, perfasm is available on Linux, xperfasm is available on Windows. If there is a way to gain the data which program addresses are hot on Mac OS X, we can implement the same there. Patches welcome.Gestation

© 2022 - 2024 — McMap. All rights reserved.