Emulating a memory barrier in Java to get rid of volatile reads

Asked 11/10, 2019 at 7:56 Answered 18/5, 2020 at 14:1

Assume I have a field that's accessed concurrently and it's read many times and seldom written to.

public Object myRef = new Object();

Let's say a Thread T1 will be setting myRef to another value, once a minute, while N other Threads will be reading myRef billions of times continuously and concurrently. I only need that myRef is eventually visible to all threads.

A simple solution would be to use an AtomicReference or simply volatile like this:

public volatile Object myRef = new Object();

However, afaik volatile reads do incur a performance cost. I know it's minuscule, this is more like something I wonder rather than something I actually need. So let's not be concerned with performance and assume this a purely theoretical question.

So the question boils down to: Is there way to safely bypass volatile reads for references that are only seldom written to, by doing something at the write site?

After some reading, it looks like memory barriers could be what I need. So if a construct like this existed, my problem would be solved:

Write
Invoke Barrier (sync)
Everything is synced and all threads will see the new value. (without a permanent cost at read sites, it can be stale or incur a one time cost as the caches are synced, but after that it's all back to regular field gets till next write).

Is there such a construct in Java, or in general? At this point I can't help but think if something like this existed, it would have been already incorporated into the atomic packages by the much smarter people maintaining those. (Disproportionately frequent read vs write might not have been a case to care for?) So maybe there is something wrong in my thinking and such a construct is not possible at all?

I have seen some code samples use 'volatile' for a similar purpose, exploiting it's happen-before contract. There is a separate sync field e.g.:

public Object myRef = new Object();
public volatile int sync = 0;

and at writing thread/site:

myRef = new Object();
sync += 1 //volatile write to emulate barrier

I am not sure this works, and some argue this works on x86 architecture only. After reading related sections in JMS, I think it's only guaranteed to work if that volatile write is coupled with a volatile read from the threads who need to see the new value of myRef. (So doesn't get rid of the volatile read).

Returning to my original question; is this possible at all? Is it possible in Java? Is it possible in one of the new APIs in Java 9 VarHandles?

Rattlebrain answered 11/10, 2019 at 7:56 Comment(5)

To me it sounds like you're well into the territory where you need to write and run some actual benchmarks simulating your workloads. – Lauranlaurance 11/10, 2019 at 8:9

The JMM states that if your writer thread does sync += 1; and your reader threads read the sync value, they will see the myRef update, too. Because you only need the readers to see the update eventually, you could use this to your advantage to only read sync on every 1000th iteration of the reader thread, or something similar. But you can do a similar trick with volatile, too - just cache the myRef field in the readers for 1000 iterations, then read it again using volatile... – Pip 11/10, 2019 at 8:13

@PetrJaneček But does not he have to synchronize the access to the counter variable which is shared between thread? Won't that be a bottleneck? In my opinion that would be even more costly. – Prefatory 11/10, 2019 at 8:15

@RavindraRanwala Every reader will have its own counter, if you mean counting to the 1000 iterations or so. If you meant the sync field, no, readers would not touch the sync field on every iteration, they'd do it opportunistically, when they want to check whether there has been an update. That said, a simpler solution would be to cache the myRef for a 1000 rounds, then reread it... – Pip 11/10, 2019 at 8:18

@PetrJaneček thanks, I have thought about it as a possible solution. But I'm wondering if this is possible using a generic, solid implementation. – Rattlebrain 11/10, 2019 at 8:25

So basically you want the semantics of a volatile without the runtime cost.

I don't think it is possible.

The problem is that the runtime cost of volatile is due the instructions that implement the memory barriers in the writer and the reader code. If you "optimize" the reader by getting rid of its memory barrier, then you are no longer guaranteed that the reader will see the "seldomly written" new value when it is actually written.

FWIW, some versions of the sun.misc.Unsafe class provide explicit loadFence, storeFence and fullFence methods, but I don't think that using them will give any performance benefit over using a volatile.

Hypothetically ...

what you want is for one processor in a multi-processor system to be able to tell all of the other processors:

"Hey! Whatever you are doing, invalidate your memory cache for address XYZ, and do it now."

Unfortunately, modern ISAs don't support this.

In practice, each processor controls its own cache.

Gonococcus answered 18/10, 2019 at 5:20 Comment(1)

I see, that hypothetically part in your answer is what I was after. Thanks. – Rattlebrain 20/10, 2019 at 14:41

Not quite sure if this is correct but I might solve this using a queue.

Create a class that wraps an ArrayBlockingQueue attribute. The class has an update method and a read method. The update method posts the new value onto the queue and removes all values except the last value. The read method returns the result of a peek operation on the queue, i.e. read but do not remove. Threads peeking the element at the front of the queue do so unimpeded. Threads updating the queue do so cleanly.

Henning answered 18/10, 2019 at 2:57 Comment(0)

You can use ReentrantReadWriteLock which is designed for few writes many reads scenario.

You can use StampedLock which is designed for the same case of few writes many reads, but also reads can be attempted optimistically. Example:

private StampedLock lock = new StampedLock();

public void modify() {            // write method
    long stamp = lock.writeLock();
    try {
      modifyStateHere();
    } finally {
      lock.unlockWrite(stamp);
    }
} 

public Object read() {            // read method
  long stamp = lock.tryOptimisticRead();
  Object result = doRead();       //try without lock, method should be fast
  if (!lock.validate(stamp)) {    //optimistic read failed
    stamp = lock.readLock();      //acquire read lock and repeat read
    try {
      result = doRead();
    } finally {
      lock.unlockRead(stamp);
    }
  }
  return result;
}

Make your state immutable and allow controlled modifications only by cloning the existing object and altering only necessary properties via constructor. Once the new state is constructed, you assign it to the reference being read by the many reading threads. This way reading threads incur zero cost.

Pilchard answered 18/10, 2019 at 8:37 Comment(2)

if you feel like downvoting, please state why, so that the author and the community can learn – Pilchard 18/10, 2019 at 11:30

Making it immutable is not possible in my scenario. And I would be quite surprised if the stamped lock case incurs less cost than a simple volatile read. However I will try it, thanks. – Rattlebrain 20/10, 2019 at 14:44

X86 provides TSO; you get [LoadLoad][LoadStore][StoreStore] fences for free.

A volatile read requires release semantics.

r1=Y
[LoadLoad]
[LoadStore]
...

And as you can see, this is already provided by the X86 for free.

In your case most of the calls are a read and the cacheline will already be in the local cache.

There is a price to pay on compiler level optimizations, but on a hardware level, a volatile read is just as expensive as a regular read.

On the other hand the volatile write is more expensive because it requires a [StoreLoad] to guarantee sequential consistency (in the JVM this is done using a lock addl %(rsp),0 or an MFENCE). Since writes are very seldom in your situation, this isn't an issue.

I would be careful with optimizations on this level because it is very easy to make the code more complex than is actually needed. Best to guide your development efforts by some benchmarks e.g. using JMH and preferably test it on real hardware. Also there could be other nasty creatures hidden like false sharing.

Dol answered 18/5, 2020 at 14:1 Comment(0)

Recommended topics

Hot tags