Java lock-free performance JMH
Asked Answered
S

1

5

I have a JMH multi-thread test:

@State(Scope.Benchmark)
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Fork(value = 1, jvmArgsAppend = { "-Xmx512m", "-server", "-XX:+AggressiveOpts","-XX:+UnlockDiagnosticVMOptions",
        "-XX:+UnlockExperimentalVMOptions", "-XX:+PrintAssembly", "-XX:PrintAssemblyOptions=intel",
        "-XX:+PrintSignatureHandlers"})
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Warmup(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS)
public class LinkedQueueBenchmark {
private static final Unsafe unsafe = UnsafeProvider.getUnsafe();
private static final long offsetObject;
private static final long offsetNext;

private static final int THREADS = 5;
private static class Node {
    private volatile Node next;
    public Node() {}
}

static {
    try {
        offsetObject = unsafe.objectFieldOffset(LinkedQueueBenchmark.class.getDeclaredField("object"));
        offsetNext = unsafe.objectFieldOffset(Node.class.getDeclaredField("next"));
    } catch (Exception ex) { throw new Error(ex); }
}

protected long t0,t1,t2,t3,t4,t5,t6,t7;
private volatile Node object = new Node(null);


@Threads(THREADS)
@Benchmark
public Node doTestCasSmart() {
    Node current, o = new Node();
    for(;;) {
        current = this.object;
        if (unsafe.compareAndSwapObject(this, offsetObject, current, o)) {
            //current.next = o; //Special line:
            break;
        } else {
            LockSupport.parkNanos(1);
        }
    }
    return current;
}
}
  1. In current variant I have performance ~ 55 ops/us
  2. But, if I uncomment "Special line", or replace it with unsafe.putOrderedObject (in any direction - current.next = o or o.next = current), performance ~ 2 ops/us.

As I understand, that is something happens with CPU-caches and maybe it's cleaning the store-buffer. If I do replace it to lock-based method, without CAS, performance will be 11-20 ops/us.
I trying use LinuxPerfAsmProfiler and PrintAssembly, in the second case I see:

....[Hottest Regions]...............................................................................
 25.92%   17.93%  [0x7f1d5105fe60:0x7f1d5105fe69] in SpinPause (libjvm.so)
 17.53%   20.62%  [0x7f1d5119dd88:0x7f1d5119de57] in ParMarkBitMap::live_words_in_range(HeapWord*, oopDesc*) const (libjvm.so)
 10.81%    6.30%  [0x7f1d5129cff5:0x7f1d5129d0ed] in ParallelTaskTerminator::offer_termination(TerminatorTerminator*) (libjvm.so)
  7.99%    9.86%  [0x7f1d3c51d280:0x7f1d3c51d3a2] in com.jad.generated.LinkedQueueBenchmark_doTestCasSmart::doTestCasSmart_thrpt_jmhStub 

Can someone explain to me what really happen? Why it's so slow? Where here store-load barrier? Why is putOrdered not working? And how to fix it?

Serpent answered 11/10, 2015 at 13:4 Comment(0)
T
9

Rule: Instead of looking for "advanced" answers, you should first look for stupid mistakes.

SpinPause, ParMarkBitMap::live_words_in_range(HeapWord*, oopDesc*) and ParallelTaskTerminator::offer_termination(TerminatorTerminator*) are from GC threads. Which may very well mean most of the work benchmark does is GC. Indeed, running the "special line" uncommented with -prof gc yields:

# Run complete. Total time: 00:00:43

Benchmark                      Mode  Cnt      Score    Error   Units
LQB.doTestCasSmart            thrpt    5      5.930 ±  3.867  ops/us
LQB.doTestCasSmart:·gc.time   thrpt    5  29970.000               ms

Therefore, out of 43 seconds for the run, you've spent 30 seconds doing GC. Or, even the plain -verbose:gc shows it:

Iteration   3: [Full GC (Ergonomics)  408188K->1542K(454656K), 0.0043022 secs]
[GC (Allocation Failure)  60422K->60174K(454656K), 0.2061024 secs]
[GC (Allocation Failure)  119054K->118830K(454656K), 0.2314572 secs]
[GC (Allocation Failure)  177710K->177430K(454656K), 0.2268396 secs]
[GC (Allocation Failure)  236310K->236054K(454656K), 0.1718049 secs]
[GC (Allocation Failure)  294934K->294566K(454656K), 0.2265855 secs]
[Full GC (Ergonomics)  294566K->147408K(466432K), 0.7139546 secs]
[GC (Allocation Failure)  206288K->205880K(466432K), 0.2065388 secs]
[GC (Allocation Failure)  264760K->264312K(466432K), 0.2314117 secs]
[GC (Allocation Failure)  323192K->323016K(466432K), 0.2183271 secs]
[Full GC (Ergonomics)  323016K->322663K(466432K), 2.8058725 secs]

2.8s full GCs, that sucks. Around 5s spent in GC, in a iteration that is bounded by 5s of run time. That sucks too.

Why is it? Well, you are building the linked list there. Granted, the head of the queue is unreachable, and everything from the head to your object should be collected. But the collection is not instantaneous. The longer the queue gets, the more memory it consumes, the more work for GC to traverse it. That's a positive feedback loop that cripples down the execution. Since there the queue elements are collectable anyhow, this feedback loop never reaches OOME. Storing an intial object in a new head field will make the test to eventually OOME.

Therefore, your question has nothing to do with either putOrdered, or memory barriers, or queue performance, to be frank. I think you need to reconsider what do you actually test. Designing the test so that the transient memory footprint stays the same on each @Benchmark call is an art in itself.

Tonometer answered 11/10, 2015 at 13:45 Comment(1)
I really not explored it on GC perf. Yes, this test is surrogate. I'm trying to build LIFO based on CAS. I have my simple realization, where method offer(), always getting current tail (it's always not null), CAS replace to new tail, and orderd put next in previous tail. Profiler said, that this method is most slow. I have equal performance with ConcurrentLinkedQueue, when writers <= readers, and worse in another case. Maybe it's on GC too.Serpent

© 2022 - 2024 — McMap. All rights reserved.