Direct ByteBuffer relative vs absolute read performance

public class DirectByteBufferReadBenchmark { private static final int OBJ_SIZE = 8 + 4 + 1; private static final int NUM_ELEM = 10_000_000; @State(Scope.Benchmark) public static class Data { private ByteBuffer directByteBuffer; @Setup public void setup() { directByteBuffer = ByteBuffer.allocateDirect(OBJ_SIZE * NUM_ELEM); for (int i = 0; i < NUM_ELEM; i++) { directByteBuffer.putLong(i); directByteBuffer.putInt(i); directByteBuffer.put((byte) (i & 1)); } } } @Benchmark @BenchmarkMode(Mode.Throughput) @OutputTimeUnit(TimeUnit.SECONDS) public long testReadAbsolute(Data d) throws InterruptedException { long val = 0l; for (int i = 0; i < NUM_ELEM; i++) { int index = OBJ_SIZE * i; val += d.directByteBuffer.getLong(index); d.directByteBuffer.getInt(index + 8); d.directByteBuffer.get(index + 12); } return val; } @Benchmark @BenchmarkMode(Mode.Throughput) @OutputTimeUnit(TimeUnit.SECONDS) public long testReadRelative(Data d) throws InterruptedException { d.directByteBuffer.rewind(); long val = 0l; for (int i = 0; i < NUM_ELEM; i++) { val += d.directByteBuffer.getLong(); d.directByteBuffer.getInt(); d.directByteBuffer.get(); } return val; } public static void main(String[] args) throws Exception { Options opt = new OptionsBuilder() .include(DirectByteBufferReadBenchmark.class.getSimpleName()) .warmupIterations(5) .measurementIterations(5) .forks(3) .threads(1) .build(); new Runner(opt).run(); } }

Benchmark Mode Cnt Score Error Units DirectByteBufferReadBenchmark.testReadAbsolute thrpt 15 88.605 ± 9.276 ops/s DirectByteBufferReadBenchmark.testReadRelative thrpt 15 42.904 ± 3.018 ops/s

# JMH 1.13 (released 45 days ago) # VM version: JDK 9-ea, VM 9-ea+134 # VM invoker: /Library/Java/JavaVirtualMachines/jdk-9.jdk/Contents/Home/bin/java # VM options: <none> Benchmark Mode Cnt Score Error Units DirectByteBufferReadBenchmark.testReadAbsolute thrpt 15 102.170 ± 10.199 ops/s DirectByteBufferReadBenchmark.testReadRelative thrpt 15 45.988 ± 3.896 ops/s

JDK 8 indeed generates worse code for the loop with relative ByteBuffer access.

JMH has built-in perfasm profiler that prints generated assembly code for the hottest regions. I've used it to compare the compiled testReadAbsolute vs. testReadRelative, and here are the main differences:

Relative getLong / getInt/ get update position field of the ByteBuffer. VM does not optimize these updates: there are 3 memory writes on each loop iteration.
position range check is not eliminated: conditional branches on each loop iteration remained in compiled code.
Since redundant field updates and range checks make the loop body longer, VM unrolls only 2 iterations of the loop. The compiled version for the loop with absolute access has 16 iterations unrolled.

testReadAbsolute is compiled very well: the main loop just reads 16 longs, sums them up and jumps to the next iteration if index < 10_000_000 - 16. The state of directByteBuffer is not updated. However, JVM is not that smart for testReadRelative: seems like it cannot optimize field access of an object from outside.

There was much work in JDK 9 to optimize ByteBuffer. I've run the same test on JDK 9-ea b134, and verified that testReadRelative does not have redundant memory writes and range checks. Now it runs almost as fast as testReadAbsolute.

// JDK 1.8.0_92, VM 25.92-b14

Benchmark                                        Mode  Cnt   Score   Error  Units
DirectByteBufferReadBenchmark.testReadAbsolute  thrpt   10  99,727 ± 0,542  ops/s
DirectByteBufferReadBenchmark.testReadRelative  thrpt   10  47,126 ± 0,289  ops/s

// JDK 9-ea, VM 9-ea+134

Benchmark                                        Mode  Cnt    Score   Error  Units
DirectByteBufferReadBenchmark.testReadAbsolute  thrpt   10  109,369 ± 0,403  ops/s
DirectByteBufferReadBenchmark.testReadRelative  thrpt   10   97,140 ± 0,572  ops/s

UPDATE

In order to help JIT compiler with optimization I've introduced local variable

ByteBuffer directByteBuffer = d.directByteBuffer

in both benchmarks. Otherwise level of indirection does not allow compiler to eliminate ByteBuffer.position field updates.

Recommended topics

Hot tags