Java lock-free performance JMH

@State(Scope.Benchmark) @BenchmarkMode(Mode.Throughput) @OutputTimeUnit(TimeUnit.MICROSECONDS) @Fork(value = 1, jvmArgsAppend = { "-Xmx512m", "-server", "-XX:+AggressiveOpts","-XX:+UnlockDiagnosticVMOptions", "-XX:+UnlockExperimentalVMOptions", "-XX:+PrintAssembly", "-XX:PrintAssemblyOptions=intel", "-XX:+PrintSignatureHandlers"}) @Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS) @Warmup(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS) public class LinkedQueueBenchmark { private static final Unsafe unsafe = UnsafeProvider.getUnsafe(); private static final long offsetObject; private static final long offsetNext; private static final int THREADS = 5; private static class Node { private volatile Node next; public Node() {} } static { try { offsetObject = unsafe.objectFieldOffset(LinkedQueueBenchmark.class.getDeclaredField("object")); offsetNext = unsafe.objectFieldOffset(Node.class.getDeclaredField("next")); } catch (Exception ex) { throw new Error(ex); } } protected long t0,t1,t2,t3,t4,t5,t6,t7; private volatile Node object = new Node(null); @Threads(THREADS) @Benchmark public Node doTestCasSmart() { Node current, o = new Node(); for(;;) { current = this.object; if (unsafe.compareAndSwapObject(this, offsetObject, current, o)) { //current.next = o; //Special line: break; } else { LockSupport.parkNanos(1); } } return current; } }

....[Hottest Regions]............................................................................... 25.92% 17.93% [0x7f1d5105fe60:0x7f1d5105fe69] in SpinPause (libjvm.so) 17.53% 20.62% [0x7f1d5119dd88:0x7f1d5119de57] in ParMarkBitMap::live_words_in_range(HeapWord*, oopDesc*) const (libjvm.so) 10.81% 6.30% [0x7f1d5129cff5:0x7f1d5129d0ed] in ParallelTaskTerminator::offer_termination(TerminatorTerminator*) (libjvm.so) 7.99% 9.86% [0x7f1d3c51d280:0x7f1d3c51d3a2] in com.jad.generated.LinkedQueueBenchmark_doTestCasSmart::doTestCasSmart_thrpt_jmhStub

Rule: Instead of looking for "advanced" answers, you should first look for stupid mistakes.

SpinPause, ParMarkBitMap::live_words_in_range(HeapWord*, oopDesc*) and ParallelTaskTerminator::offer_termination(TerminatorTerminator*) are from GC threads. Which may very well mean most of the work benchmark does is GC. Indeed, running the "special line" uncommented with -prof gc yields:

# Run complete. Total time: 00:00:43

Benchmark                      Mode  Cnt      Score    Error   Units
LQB.doTestCasSmart            thrpt    5      5.930 ±  3.867  ops/us
LQB.doTestCasSmart:·gc.time   thrpt    5  29970.000               ms

Therefore, out of 43 seconds for the run, you've spent 30 seconds doing GC. Or, even the plain -verbose:gc shows it:

Iteration   3: [Full GC (Ergonomics)  408188K->1542K(454656K), 0.0043022 secs]
[GC (Allocation Failure)  60422K->60174K(454656K), 0.2061024 secs]
[GC (Allocation Failure)  119054K->118830K(454656K), 0.2314572 secs]
[GC (Allocation Failure)  177710K->177430K(454656K), 0.2268396 secs]
[GC (Allocation Failure)  236310K->236054K(454656K), 0.1718049 secs]
[GC (Allocation Failure)  294934K->294566K(454656K), 0.2265855 secs]
[Full GC (Ergonomics)  294566K->147408K(466432K), 0.7139546 secs]
[GC (Allocation Failure)  206288K->205880K(466432K), 0.2065388 secs]
[GC (Allocation Failure)  264760K->264312K(466432K), 0.2314117 secs]
[GC (Allocation Failure)  323192K->323016K(466432K), 0.2183271 secs]
[Full GC (Ergonomics)  323016K->322663K(466432K), 2.8058725 secs]

2.8s full GCs, that sucks. Around 5s spent in GC, in a iteration that is bounded by 5s of run time. That sucks too.

Why is it? Well, you are building the linked list there. Granted, the head of the queue is unreachable, and everything from the head to your object should be collected. But the collection is not instantaneous. The longer the queue gets, the more memory it consumes, the more work for GC to traverse it. That's a positive feedback loop that cripples down the execution. Since there the queue elements are collectable anyhow, this feedback loop never reaches OOME. Storing an intial object in a new head field will make the test to eventually OOME.

Therefore, your question has nothing to do with either putOrdered, or memory barriers, or queue performance, to be frank. I think you need to reconsider what do you actually test. Designing the test so that the transient memory footprint stays the same on each @Benchmark call is an art in itself.

Recommended topics

Hot tags