Strange behavior in sun.misc.Unsafe.compareAndSwap measurement via JMH

@State(Scope.Benchmark) @BenchmarkMode({Mode.Throughput, Mode.AverageTime}) @OutputTimeUnit(TimeUnit.MICROSECONDS) @Fork(1) @Warmup(iterations = 5) @Measurement(iterations = 5) public class UnsafeCounter_Benchmark { public Counter unsync, syncNoV, syncV, lock, atomic, unsafe, unsafeGA; @Setup(Level.Iteration) public void prepare() { unsync = new UnsyncCounter(); syncNoV = new SyncNoVolatileCounter(); syncV = new SyncVolatileCounter(); lock = new LockCounter(); atomic = new AtomicCounter(); unsafe = new UnsafeCASCounter(); unsafeGA = new UnsafeGACounter(); } @Benchmark public void unsyncCount() { unsyncCounter(); } @CompilerControl(CompilerControl.Mode.DONT_INLINE) public void unsyncCounter() { unsync.increment(); } @Benchmark public void syncNoVCount() { syncNoVCounter(); } @CompilerControl(CompilerControl.Mode.DONT_INLINE) public void syncNoVCounter() { syncNoV.increment(); } @Benchmark public void syncVCount() { syncVCounter(); } @CompilerControl(CompilerControl.Mode.DONT_INLINE) public void syncVCounter() { syncV.increment(); } @Benchmark public void lockCount() { lockCounter(); } @CompilerControl(CompilerControl.Mode.DONT_INLINE) public void lockCounter() { lock.increment(); } @Benchmark public void atomicCount() { atomicCounter(); } @CompilerControl(CompilerControl.Mode.DONT_INLINE) public void atomicCounter() { atomic.increment(); } @Benchmark public void unsafeCount() { unsafeCounter(); } @CompilerControl(CompilerControl.Mode.DONT_INLINE) public void unsafeCounter() { unsafe.increment(); } @Benchmark public void unsafeGACount() { unsafeGACounter(); } @CompilerControl(CompilerControl.Mode.DONT_INLINE) public void unsafeGACounter() { unsafeGA.increment(); } public static void main(String[] args) throws RunnerException { Options baseOpts = new OptionsBuilder() .include(UnsafeCounter_Benchmark.class.getSimpleName()) .threads(100) .jvmArgs("-ea") .build(); new Runner(baseOpts).run(); } }

Benchmark Mode Samples Score Error Units o.k.u.u.UnsafeCounter_Benchmark.atomicCount thrpt 5 42.178 ± 17.643 ops/us o.k.u.u.UnsafeCounter_Benchmark.lockCount thrpt 5 24.044 ± 2.264 ops/us o.k.u.u.UnsafeCounter_Benchmark.syncNoVCount thrpt 5 22.849 ± 1.344 ops/us o.k.u.u.UnsafeCounter_Benchmark.syncVCount thrpt 5 20.235 ± 2.027 ops/us o.k.u.u.UnsafeCounter_Benchmark.unsafeCount thrpt 5 12.460 ± 1.326 ops/us o.k.u.u.UnsafeCounter_Benchmark.unsafeGACount thrpt 5 39.106 ± 2.966 ops/us o.k.u.u.UnsafeCounter_Benchmark.unsyncCount thrpt 5 93.076 ± 9.674 ops/us o.k.u.u.UnsafeCounter_Benchmark.atomicCount avgt 5 2.604 ± 0.133 us/op o.k.u.u.UnsafeCounter_Benchmark.lockCount avgt 5 4.161 ± 0.546 us/op o.k.u.u.UnsafeCounter_Benchmark.syncNoVCount avgt 5 4.440 ± 0.523 us/op o.k.u.u.UnsafeCounter_Benchmark.syncVCount avgt 5 5.073 ± 0.439 us/op o.k.u.u.UnsafeCounter_Benchmark.unsafeCount avgt 5 9.088 ± 5.964 us/op o.k.u.u.UnsafeCounter_Benchmark.unsafeGACount avgt 5 2.611 ± 0.164 us/op o.k.u.u.UnsafeCounter_Benchmark.unsyncCount avgt 5 1.047 ± 0.050 us/op

Counter result: UnsyncCounter 97538676 Time passed in ms:259 Counter result: AtomicCounter 100000000 Time passed in ms:1805 Counter result: LockCounter 100000000 Time passed in ms:3904 Counter result: SyncNoVolatileCounter 100000000 Time passed in ms:14227 Counter result: SyncVolatileCounter 100000000 Time passed in ms:19224 Counter result: UnsafeCASCounter 100000000 Time passed in ms:8077 Counter result: UnsafeGACounter 100000000 Time passed in ms:2549

public class UnsafeCounter_Test { static class CounterClient implements Runnable { private Counter c; private int num; public CounterClient(Counter c, int num) { this.c = c; this.num = num; } @Override public void run() { for (int i = 0; i < num; i++) { c.increment(); } } } public static void makeTest(Counter counter) throws InterruptedException { int NUM_OF_THREADS = 1000; int NUM_OF_INCREMENTS = 100000; ExecutorService service = Executors.newFixedThreadPool(NUM_OF_THREADS); long before = System.currentTimeMillis(); for (int i = 0; i < NUM_OF_THREADS; i++) { service.submit(new CounterClient(counter, NUM_OF_INCREMENTS)); } service.shutdown(); service.awaitTermination(1, TimeUnit.MINUTES); long after = System.currentTimeMillis(); System.out.println("Counter result: " + counter.getClass().getSimpleName() + " " + counter.getCounter()); System.out.println("Time passed in ms:" + (after - before)); } public static void main(String[] args) throws InterruptedException { makeTest(new UnsyncCounter()); makeTest(new AtomicCounter()); makeTest(new LockCounter()); makeTest(new SyncNoVolatileCounter()); makeTest(new SyncVolatileCounter()); makeTest(new UnsafeCASCounter()); makeTest(new UnsafeGACounter()); } }

Thinking out loud: it is remarkable how often people do 90% of the tedious work, and leave the 10% (where the fun begins) for someone else! All right, I'm taking all the fun!

Let me repeat the experiment first on my i7-4790K, 8u40 EA:

Benchmark                                 Mode  Samples    Score    Error   Units
UnsafeCounter_Benchmark.atomicCount      thrpt        5   47.669 ± 18.440  ops/us
UnsafeCounter_Benchmark.lockCount        thrpt        5   14.497 ±  7.815  ops/us
UnsafeCounter_Benchmark.syncNoVCount     thrpt        5   11.618 ±  2.130  ops/us
UnsafeCounter_Benchmark.syncVCount       thrpt        5   11.337 ±  4.532  ops/us
UnsafeCounter_Benchmark.unsafeCount      thrpt        5    7.452 ±  1.042  ops/us
UnsafeCounter_Benchmark.unsafeGACount    thrpt        5   43.332 ±  3.435  ops/us
UnsafeCounter_Benchmark.unsyncCount      thrpt        5  102.773 ± 11.943  ops/us

Truly, something seems fishy about unsafeCount test. Really, you have to presume all data is fishy before you validated it. For nanobenchmarks, you have to validate the generated code to see if you actually measure something you want to measure. In JMH, it is very quickly doable with -prof perfasm. In fact, if you look at the hottest region of unsafeCount there, you will notice a few funny things:

  0.12%    0.04%    0x00007fb45518e7d1: mov    0x10(%r10),%rax    
 17.03%   23.44%    0x00007fb45518e7d5: test   %eax,0x17318825(%rip)
  0.21%    0.07%    0x00007fb45518e7db: mov    0x18(%r10),%r11    ; getfield offset
 30.33%   10.77%    0x00007fb45518e7df: mov    %rax,%r8
  0.00%             0x00007fb45518e7e2: add    $0x1,%r8           
  0.01%             0x00007fb45518e7e6: cmp    0xc(%r10),%r12d    ; typecheck 
                    0x00007fb45518e7ea: je     0x00007fb45518e80b ; bail to v-call
  0.83%    0.48%    0x00007fb45518e7ec: lock cmpxchg %r8,(%r10,%r11,1)
 33.27%   25.52%    0x00007fb45518e7f2: sete   %r8b
  0.12%    0.01%    0x00007fb45518e7f6: movzbl %r8b,%r8d          
  0.03%    0.04%    0x00007fb45518e7fa: test   %r8d,%r8d
                    0x00007fb45518e7fd: je     0x00007fb45518e7d1 ; back branch

Translation: a) offset field gets re-read on each iteration -- because CAS memory effects imply volatile read, and therefore the field needs to be pessimistically re-read; b) the hilarious part is that unsafe field is also being re-read for a typecheck -- for the same reason.

This is why high-performance code should look like this:

--- a/utils bench/src/main/java/org/kirmit/utils/unsafe/concurrency/UnsafeCASCounter.java       
+++ b/utils bench/src/main/java/org/kirmit/utils/unsafe/concurrency/UnsafeCASCounter.java       
@@ -5,13 +5,13 @@ import sun.misc.Unsafe;

 public class UnsafeCASCounter implements Counter {
     private volatile long counter = 0;
-    private final Unsafe unsafe = UnsafeHelper.unsafe;
-    private long offset;
-    {
+    private static final Unsafe unsafe = UnsafeHelper.unsafe;
+    private static final long offset;
+    static {
         try {
             offset = unsafe.objectFieldOffset(UnsafeCASCounter.class.getDeclaredField("counter"));
         } catch (NoSuchFieldException e) {
-            e.printStackTrace();
+            throw new IllegalStateException("Whoops!");
         }
     }

If you do that, the unsafeCount performance boosts right up:

Benchmark                              Mode  Samples   Score    Error   Units
UnsafeCounter_Benchmark.unsafeCount    thrpt        5  9.733 ± 0.673  ops/us

...which is fairly close to synchronized tests now, given the error bounds. If you look at the -prof perfasm now, this is an unsafeCount loop:

  0.08%    0.02%    0x00007f7575191900: mov    0x10(%r10),%rax       
 28.09%   28.64%    0x00007f7575191904: test   %eax,0x161286f6(%rip) 
  0.23%    0.08%    0x00007f757519190a: mov    %rax,%r11
                    0x00007f757519190d: add    $0x1,%r11
                    0x00007f7575191911: lock cmpxchg %r11,0x10(%r10)
 47.27%   23.48%    0x00007f7575191917: sete   %r8b
  0.10%             0x00007f757519191b: movzbl %r8b,%r8d        
  0.02%             0x00007f757519191f: test   %r8d,%r8d
                    0x00007f7575191922: je     0x00007f7575191900

This loop is very tight, and it seems nothing can make it go faster. We spend most of the time loading the "updated" value and actually CAS-ing it. But we contend a lot! To figure out if contention is the leading cause, let's add backoffs:

--- a/utils bench/src/main/java/org/kirmit/utils/unsafe/concurrency/UnsafeCASCounter.java       
+++ b/utils bench/src/main/java/org/kirmit/utils/unsafe/concurrency/UnsafeCASCounter.java       
@@ -20,6 +21,7 @@ public class UnsafeCASCounter implements Counter {
         long before = counter;
         while (!unsafe.compareAndSwapLong(this, offset, before, before + 1L)) {
             before = counter;
+            Blackhole.consumeCPU(1000);
         }
     }

...running:

Benchmark                                 Mode  Samples    Score    Error   Units
UnsafeCounter_Benchmark.unsafeCount      thrpt        5   99.869 ± 107.933  ops/us

Voila. We do more work in the loop, but it saves us from contending a lot. I tried to explain this before in "Nanotrusting the Nanotime", it might be good to go back there and read up more on benchmarking methodology, especially when heavy-weight operations are measured. This highlights the pitfall in the entire experiment, not only with unsafeCount.

Exercise for the OP and interested readers: explain why unsafeGACount and atomicCount perform much faster than other tests. You have the tools now.

P.S. Running N threads on machine having C (C < N) threads is silly: you might think you have "contention" with N threads, but instead you are running and "contending" C threads only. It is especially amusing when people do 1000 threads on 4 core machine...

P.P.S. Time check: 10 minutes to do the profiling and additional experiments, 20 minutes to write it up. And how much time you wasted replicating the result by hand? ;)

Recommended topics

Hot tags