Implementing an acquire for a release from Unsafe.putOrdered*()?
Asked Answered
B

2

10

What do you think is the best correct way for implementing the acquire part of a release/acquire pair in Java?

I'm trying to model some of the actions in an application of mine using classic release/acquire semantics (without StoreLoad and without sequential consistency across threads).

There are a couple of ways to achieve the rough equivalent of a store-release in the JDK. java.util.concurrent.Atomic*.lazySet() and the underlying sun.misc.Unsafe.putOrdered*() are the most often cited approaches to do that. However there's no obvious way to implement a load-acquire.

  • The JDK APIs which allow lazySet() mostly use volatile variables internally, so their store-releases are paired with volatile loads. In theory volatile loads should be more expensive than load-acquires, and should not provide anything more than a pure load-acquire in the context of a preceding store-release.

  • sun.misc.Unsafe does not provide getAcquire()* equivalents of the putOrdered*() methods, even though such acquire methods are planned for the upcoming VarHandles API.

  • Something that sounds like it would work is a plain load, followed by sun.misc.Unsafe.loadFence(). It's somewhat disconcerting that I haven't seen this anywhere else. This may be related to the fact that it's a pretty ugly hack.

P.S. I understand well that these mechanisms are not covered by the JMM, that they are not sufficient for maintaining sequential consistency, and that the actions they create are not synchronization actions (e.g. I understand that they for example break IRIW). I also understand that the store-releases provided by Atomic*/Unsafe are most often used either for eagerly nulling out references or in producer/consumer scenarios, as an optimized message passing mechanism for some important index.

Bookbinder answered 8/5, 2016 at 22:14 Comment(0)
O
8

Volatile read is exactly what you are looking for.

In fact, corresponding volatile operations already have release/acquire semantics (otherwise happens-before is not possible for paired volatile write-read), but paired volatile operations should not only be sequentially consistent (~happens-before), but also they should be in total synchronization order, thats why StoreLoad barrier is inserted after volatile write: to guarantee total order of volatile writes to different locations, so all threads will see those values in the same order.

Volatile read has acquire semantics: proof from hotspot codebase, also there is direct recommendation by Doug Lea in JSR-133 cookbook (LoadLoad and LoadStore barriers after each volatile read).

Unsafe.loadFence() also has acquire semantics (proof), but used not to read value (you can do the same with plain volatile read), but to prevent reorder plain reads with subsequent volatile read. This is used in StampedLock for optimistic reading (see StampedLock#validate method implementation and usages).

Update after discussion in comments.

Let's check if Unsafe#loadStore() and volatile read are the same and have acquire semantics.

I'm looking at hotspot C1 compiler source code to avoid reading through all the optimizations in C2. It transforms bytecode (in fact, not bytecode, but its interpreter representation) into LIR (Low-Level Intermediate Representation) and then translates graph to actual opcodes depends on target microarchitecture.

Unsafe#loadFence is intrinsic which has _loadFence alias. In C1 LIR generator it generates this:

case vmIntrinsics::_loadFence :
if (os::is_MP()) __ membar_acquire();

where __ is macros for LIR generation.

Now let's look at volatile read implementation in the same LIR generator. It tries to insert null checks, checks IRIW, checks if we are on x32 and trying to read 64-bit value (to make some magic with SSE/FPU) and, finally, leads us to the same code:

if (is_volatile && os::is_MP()) {
    __ membar_acquire();
}

Assembler generator then inserts platform-specific acquire instruction(s) here.

Looking at specific implementations (no links here, but all can be found in src/cpu/{$cpu_model}/vm/c1_LIRAssembler_{$cpu_model}.cpp)

  • SPARC

    void LIR_Assembler::membar_acquire() {
        // no-op on TSO
    }
    
  • x86

    void LIR_Assembler::membar_acquire() {
        // No x86 machines currently require load fences
    }
    
  • Aarch64 (weak memory model, barriers should be present)

    void LIR_Assembler::membar_acquire() {
        __ membar(Assembler::LoadLoad|Assembler::LoadStore);
    }
    

    According to aarch architecture description such membar will be compiled as dmb ishld instruction after load.

  • PowerPC (also weak memory model)

    void LIR_Assembler::membar_acquire() {
        __ acquire();
    }
    

    which then transforms into specific PowerPC instruction lwsync. According to the comments lwsync is semantically equivalent to

    lwsync orders Store|Store, Load|Store, Load|Load, but not Store|Load

    But as long as PowerPC hasn't any weaker barriers, this is the only choice to implement acquire semantics on PowerPC.

Conclusions

Volatile reads and Unsafe#loadFence() are equal in terms of memory ordering (but maybe not in terms of possible compiler optimizations), on most popular x86 it's no-op, and PowerPC is the only supported architecture with has no precise acquire barriers.

Ochre answered 8/5, 2016 at 23:29 Comment(13)
Thanks for taking the time to write an answer! It's filled with nice general info, but while that's useful for some folks, it doesn't provide any new insights not present in the question or any justification why to use specifically volatile load (as opposed to loadFence(), or something else).Bookbinder
Maybe I don't get what exactly you need. Why doesn't it answer? There is nothing to use except vread/loadFence(). Use volatile read if you want a/r semantics on specific variable and loadFence() if you need barriers only e.g. for set of variables or before reading variable to avoid specific reordering. I will edit my answer as soon as I understand you :) Guess I can remove part about no-ops and VH and add comparison of loadFence() and vread if you want itOchre
You get acquire semantics from both volatile read and loadFence(), and in theory, volatile load may be more expensive on some architectures because it needs to be part of the synchronization actions. So it's not as simple as "just use volatile read", at least not without slightly more substantial argumentation. I assumed that volatile read and loadFence() are the only viable options, but I still hope that someone will bring up yet another interesting option.Bookbinder
@DimitarDimitrov From concurrency point of view they are the same: to provide synchronized-with relationship only additional barrier on write is needed. Compiler (probably) can't reorder two consecutive volatile reads, but I can't imagine situation when such instruction reordering will be a huge win. As long as acquire semantics require LoadLoad and LoadStore only, there is no "lighter" version of such barriers than vread/loadFence()Ochre
While they may be equivalent in the HotSpot/OpenJDK implementation for x86, and while StoreLoad after write (as opposed to before read) may be the most straightforward approach to implement volatile semantics, this still doesn't automatically mean that volatile read and loadFence() are fully equivalent IMHO. They may very well be right now though. In order not to fall into further pointless theorizing I think we should check if they are for the OpenJDK implementation for x86, Power, SPARC and Aarch64 (some JMH benchmarks may come in handy here), and call it a day if they are.Bookbinder
Ok, here it is. JMH won't help in benchmarking same code on different architectures I think, because there will be too much and barriers will be indistinguishable from another CPU (power saving/frequency capping/turboboost/whatever else)Ochre
Thanks for going the extra mile! If you want to be super-exact, you can refine some details here and there. E.g. explain how assuming StoreLoad is put after volatile write (based on the expectation of less writes than reads), volatile read is the same as loadFence(), link to the OpenJDK .ad files, mention that Aarch64 transforms membar_acquire into dmb ishld, etc. When I wrote my question, I got my research to roughly your answer's conclusions (hoped to see a surprise answer). But seeing how no other alternatives for guaranteed acquire semantics were given, you get the accepted answer.Bookbinder
@DimitarDimitrov Added .ad reference with implementation detail to aarch (guess no need to do this for no-ops implementations). StoreLoad is put after volatile write not based on assumption about read/write probability, without it after write it's easy to violate synchronization order: assume you write (all ops are volatile here): x = 1, y = 1 in thread 1. Without StoreLoad it's possible for thread 2 to see that x = 0 and y = 1 (its valid reordering both in terms of happens before and in processor semantics) => global ordering for volatiles is violated.Ochre
Let me clarify. As an alternative of putting StoreLoad after each volatile write, I meant putting it before each volatile read. If that's done, loadFence() stops being equivalent of a volatile read. However no actual compiler/JVM implementation does that due to the assumption that you write to volatile variables much less often that you read from them.Bookbinder
@DimitarDimitrov JMM is a tricky part :) Such "optimisation" is not valid: if you remove StoreLoad after write and place it before read, then you will simply break JMM guarantees about volatile SO, so from perspective of JVM implementors such implementation just doesn't satisfy JLS.Ochre
@DimitarDimitrov if its still unclear for you why it's illegal, you can ask another question or write me on mail (@gmail.com), because comments are not very convenient way to discuss about itOchre
I don't agree - can you show an example of a breakage caused by StoreLoad fencing before volatile load? The JMM cookbook also doesn't agree - Note that you could instead issue one before each volatile load, but this would be slower for typical programs using volatiles in which reads greatly outnumber writes.Bookbinder
@DimitarDimitrov I write up some possible code traces, yes, such approach is legal and doesn't break anything, so it was my misunderstanding, sorry for misguidance. I will edit my answer a bit later to clarify that theoretically volatile read can be not semantically equal to loadFence(). Thank you for pointing it out!Ochre
S
1

Depending on your exact requirements, doing a non-volatile load, possibly followed by a possible volatile load is the best you can get in Java.

You can do this with a combination of

int permits = theUnsafe.getInt(object, offset);
if (!enough(permits))
    permits = theUnsafe.getVolatileInt(object, offset);

This pattern can be used in ring buffers to minimise churn of cache lines.

Saran answered 8/5, 2016 at 22:21 Comment(10)
This snippet is not very useful: volatile load and plain load are literally the same, because x86 has total store order and most of the barriers are no-ops, the only difference is in compiler barriers (not the case for this snippet), so in fact this is just two plain reads :)Ochre
@Ochre You are assuming that specifically x86 will be used :) That may be a safe bet most of the time, but it's not the case here (I'm running on Aarch64, Raspberry Pi 3 to be specific). Also while I am/we are already in dangerous waters with sun.misc.Unsafe and the underspecified release/acquire, it's nice to reason in general terms whenever possible. @PeterLawrey can't the two loads be speculatively collapsed into one volatile load? I guess that won't be that big of a problem, I'm just wondering.Bookbinder
@DimitarDimitrov they could be combined but we have seen performance advantages in using non-volatile loads to avoid taking ownership of the cache line.Saran
@Ochre they behave very differently when measured empirically on x64.. Whether this a feature of the JVM or the test we used I couldn't say.Saran
@PeterLawrey it's compiler probably. You may found this article interesting: brooker.co.za/blog/2012/09/10/volatile.htmlOchre
@Ochre Just a note. I'm afraid "volatile load and plain load are literally the same" with TSO only if another core doesn't modify that address and this doesn't invalidate your cached valueMeasure
@PeterLawrey nice trick. The question is only how to implement "enough" function without any inter-thread communication. For now I see only "value is the same as the previous one/nothing happened" rule. I mean, if, for a circular buffer, for example, a reader sees the there is no new items in the buffer with plain read, it re-reads the buffer's state with volatile read.Measure
@Measure but for plain load (from memory, not from register) effect will be the same, isn't it? For register case I've mentioned compiler barrierOchre
@Ochre yes, but is there a way to set a compiler only barrier to prevent effects like the third experiment on brooker.co.za/blog/2012/09/10/volatile.html ? I don't know such one. So, I'd say that Peter's snippet is quite interesting and there is no way to get rid of volatile load in thereMeasure
@Measure now there is no, but in Java 9 VarHandle#getOpaque appears with exactly this semantics. Now I'm speculating (and without looking in assembly I can't be sure), but in Peter's case there will be compiler barrier anyway, otherwise conditional compiler barrier will blow up the code: "(if (!enough(permits))) then read everything from memory else read from registers". But I'm happy to know cases when such trick will helpOchre

© 2022 - 2024 — McMap. All rights reserved.