Why direct memory 'array' is slower to clear than a usual Java array?

Asked 17/2, 2017 at 13:27 Answered 17/2, 2017 at 13:45

I've set up a JMH benchmark to measure what would be faster Arrays.fill with null, System.arraycopy from a null array, zeroying a DirectByteBuffer or zeroying an unsafe memory block trying to answer this question Let's put aside that zeroying a directly allocated memory is a rare case, and discuss the results of my benchmark.

Here's the JMH benchmark snippet (full code available via a gist) including unsafe.setMemory case as suggested by @apangin in the original post, byteBuffer.put(byte[], offset, length) and longBuffer.put(long[], offset, length) as suggested by @jan-schaefer:

@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void arrayFill() {
    Arrays.fill(objectHolderForFill, null);
}

@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void arrayCopy() {
    System.arraycopy(nullsArray, 0, objectHolderForArrayCopy, 0, objectHolderForArrayCopy.length);
}

@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directByteBufferManualLoop() {
    while (referenceHolderByteBuffer.hasRemaining()) {
        referenceHolderByteBuffer.putLong(0);
    }
}

@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directByteBufferBatch() {
    referenceHolderByteBuffer.put(nullBytes, 0, nullBytes.length);
}

@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directLongBufferManualLoop() {
    while (referenceHolderLongBuffer.hasRemaining()) {
        referenceHolderLongBuffer.put(0L);
    }
}

@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void directLongBufferBatch() {
    referenceHolderLongBuffer.put(nullLongs, 0, nullLongs.length);
}


@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void unsafeArrayManualLoop() {
    long addr = referenceHolderUnsafe;
    long pos = 0;
    for (int i = 0; i < size; i++) {
        unsafe.putLong(addr + pos, 0L);
        pos += 1 << 3;
    }
}

@Benchmark
@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public void unsafeArraySetMemory() {
    unsafe.setMemory(referenceHolderUnsafe, size*8, (byte) 0);
}

Here's what I got (Java 1.8, JMH 1.13, Core i3-6100U 2.30 GHz, Win10):

100 elements
Benchmark                                       Mode      Cnt   Score   Error    Units
ArrayNullFillBench.arrayCopy                   sample  5234029  39,518 ± 0,991    ns/op
ArrayNullFillBench.directByteBufferBatch       sample  6271334  43,646 ± 1,523    ns/op
ArrayNullFillBench.directLongBufferBatch       sample  4615974  45,252 ± 2,352    ns/op
ArrayNullFillBench.arrayFill                   sample  4745406  76,997 ± 3,547    ns/op
ArrayNullFillBench.unsafeArrayManualLoop       sample  5980381  78,811 ± 2,870    ns/op
ArrayNullFillBench.unsafeArraySetMemory        sample  5985884  85,062 ± 2,096    ns/op
ArrayNullFillBench.directLongBufferManualLoop  sample  4697023  116,242 ± 2,579   ns/op WOW
ArrayNullFillBench.directByteBufferManualLoop  sample  7504629  208,440 ± 10,651  ns/op WOW

I skipped all the loop implementations (except arrayFill for scale) from further tests

1000 elements
Benchmark                                 Mode      Cnt    Score   Error    Units
ArrayNullFillBench.arrayCopy              sample  6780681  184,516 ± 14,036  ns/op
ArrayNullFillBench.directLongBufferBatch  sample  4018778  293,325 ± 4,074   ns/op
ArrayNullFillBench.directByteBufferBatch  sample  4063969  313,171 ± 4,861   ns/op
ArrayNullFillBench.arrayFill              sample  6862928  518,886 ± 6,372   ns/op

10000 elements
Benchmark                                 Mode      Cnt     Score   Error    Units
ArrayNullFillBench.arrayCopy              sample  2551851  2024,543 ± 12,533  ns/op
ArrayNullFillBench.directLongBufferBatch  sample  2958517  4469,210 ± 10,376  ns/op
ArrayNullFillBench.directByteBufferBatch  sample  2892258  4526,945 ± 33,443  ns/op
ArrayNullFillBench.arrayFill              sample  5689507  5028,592 ± 9,074   ns/op

Could you please clarify the following questions:

1. Why `unsafeArraySetMemory` is a bit but slower than `unsafeArrayManualLoop`?
2. Why directByteBuffer is 2.5X-5X slower than others?

Vasiliki answered 17/2, 2017 at 13:27 Comment(0)

Why unsafeArraySetMemory is a bit but slower than unsafeArrayManualLoop?

My guess is that it not as well optimised for setting exactly multiple longs. It has to check whether you have something, not quite a multiple of 8.

Why directByteBuffer is by an order of magnitude slower than others?

An order of magnitude would be around 10x, it is about 2.5x slower. It has to bounds check every access and update a field instead of a local variable.

NOTE: I have found the JVM doesn't always loop unroll code with Unsafe. You might try doing that yourself to see if it helps.

NOTE: Native code can use XMM 128 bit instructions and is using this increasingly which is why the copy might be so fast. Access to XMM instruction may come in Java 10.

Idona answered 17/2, 2017 at 13:37 Comment(1)

If you don't mind I've corrected my post to remove this order of magnitude mess :-) Thanks for pointing it out, I've marked your asnwer as useful. – Vasiliki 17/2, 2017 at 13:48

The comparison is a bit unfair. You are using a single operation when using Array.fill and System.arraycopy, but you are using a loop and multiple invocations of putLong in the DirectByteBuffer case. If you look at the implementation of putLong you will see that there is a lot going on there like checking accessibility, for example. You should try to use a batch operation like put(long[] src, int srcOffset, int longCount) and see what happens.

Unproductive answered 17/2, 2017 at 13:45 Comment(2)

Thanks, I'll add that case with batch operation as well. Array.fill uses the same loop underneath though. unsafe.setMemory is kind-of a batch operation as well. – Vasiliki 17/2, 2017 at 13:52

Just added the case you suggested (ByteBuffer and LongBuffer versions just in case). It seems that even batch operations with DirectBuffer still slower then System.arraycopy and tend to get closer to Array.fill on a larger array size. – Vasiliki 20/2, 2017 at 12:32

Recommended topics

Hot tags