Java, volatile and memory barriers on x86 architecture

Asked 23/4, 2016 at 13:22 Answered 2/9, 2024 at 9:7

java multithreading volatile memory-barriers

This is more of a theoretical question. I'm not sure if all concepts, compiler behaviors, etc. are uptodate and still in use, but I'd like to have confirmation if I'm correctly understanding some concepts I'm trying to learn.

Language is Java.

From what I've understood so far, on X86 architecture, StoreLoad barriers (despite the exact CPU instructions used to implement them) are put after Volatile writes, to make them visible to subsequent Volatile Reads in other threads (since x86 doesn't guarantee that newer reads always see older writes) (reference http://shipilev.net/blog/2014/on-the-fence-with-dependencies/)

Now from here (http://jpbempel.blogspot.it/2013/05/volatile-and-memory-barriers.html) I see that:

public class TestJIT
{
    private volatile static int field1;
    private static int field2;
    private static int field3;
    private static int field4;
    private static int field5;
    private volatile static int field6;

    private static void assign(int i)
    {
        field1 = i << 1; // volatile
        field2 = i << 2;
        field3 = i << 3;
        field4 = i << 4;
        field5 = i << 5;
        field6 = i << 6; // volatile.
    }

    public static void main(String[] args) throws Exception
    {
        for (int i = 0; i < 10000; i++)
        {
            assign(i);
        }
        Thread.sleep(1000);
    }
}

the resulting assembly has the StoreLoad only after field6 assignment, and not after field1 assignment which however is volatile as well.

My questions:

1) Does what I have written so far make sense? Or am I totally misinterpreting something?

2) Why is the compiler omitting a StoreLoad after field1 volatile assignment? Is this an optimization? But has it some drawbacks? For example, another thread kicking in after field1 assignment, might still read an old value for field1 even if it has been actually changed?

Simonette answered 23/4, 2016 at 13:22 Comment(2)

In assembly, stores in one thread are always eventually visible to loads in other threads thanks to coherent caches. You only need barriers to control ordering between operations on different locations from the same thread. e.g. on x86 to make sure that this thread's stores become visible before it reads other locations. (StoreLoad barrier; x86 does everything else for free: preshing.com/20120930/weak-vs-strong-memory-models / preshing.com/20120913/acquire-and-release-semantics) – Bloodstock 29/8, 2024 at 3:41

So no, the reason for a StoreLoad barrier after a store to implement Java's volatile write semantics isn't just visibility, it's to get sequential consistency (for data-race-free programs), like C++ foo.store(val, seq_cst). A plain store in x86 asm is like foo.store(val, release). – Bloodstock 29/8, 2024 at 3:44

1) Does what I have written so far make sense? Or am I totally misinterpreting something?

I think you got everything correct.

2) Why is the compiler omitting a StoreLoad after field1 volatile assignment? Is this an optimization? But has it some drawbacks?

Yes, it's an optimization, but it's a pretty tricky one to get right.

Doug Lea's JMM Cookbook actually shows an example of the recommended barriers in the case of two consecutive volatile stores, and ~~there are StoreLoads after each one of them~~ there's a StoreStore (x86 no-op) between the two stores and a StoreLoad only after the second one. The Cookbook however notes that the related analysis can be fairly involved.

The compiler should be able to prove that a volatile read cannot occur in the synchronization order between the write to field1 and the write to field6. I'm not sure if that's doable (by the current HotSpot JIT) if TestJIT was changed slightly so that a comparable amount of volatile loads is executed in another thread at the same time.

For example, another thread kicking in after field1 assignment, might still read an old value for field1 even if it has been actually changed?

That should not be allowed to happen, if that volatile load follows the volatile store in the synchronization order. So as mentioned above, I think that the JIT gets away with it, because it doesn't see any volatile loads being done.

Update

Changed the details around the JMM Cookbook example, as kRs pointed out that I've mistook a StoreStore for a StoreLoad. The essence of the answer was not changed at all.

Simoom answered 23/4, 2016 at 14:43 Comment(3)

Dimitar, you exactly got my point. :) Now, elaborating more. 1 Could you point out where in the JMM Cookbok is the example about 2 consecutive volatile stores? I couldn't find it, even if I assumed there should be a storeload after each volatile write exactly from JMM cookbook (Inserting Barriers) 2 Your point makes much sense (JIT omitting the storeload since such field is not read anywhere). But in such case, very strictly speaking, shouldn't the final storeload be removed as well for the same reason? 3 The "removing barrier" JMM cbook paragraph (last table row) maybe could be related. – Simonette 23/4, 2016 at 15:36

1. I went back to check and found out that I'm actually mistaken - I've misread a StoreStore as a StoreLoad. I'll update my answer to reflect that. 2. The quick answer here is "This is probably a specificity of the corresponding JIT". Strictly speaking coarsening the StoreLoads of consecutive volatile stores is different than eliding the StoreLoad of a volatile store which will never be loaded. I'm not sure if the difference is meaningful enough in practice. 3. Yep, that section precisely explains the optimization being hit here, as well as the complexity of the related analysis. – Simoom 23/4, 2016 at 16:43

after checking again the example in the cookbook (volat write STORESTORE volat write STORELOAD) which is reconducible to the code in my question, we might assume that the compiler in such cases is "sacrifying" an immediate "commit" of the first volatile change (which might not even be visible to another thread trying to read it before the 2nd volatile write in the 1st thread) to avoid 2 near expensive storeload? – Simonette 23/4, 2016 at 17:35

Why is the compiler omitting a StoreLoad after field1 volatile assignment?

Only the first load and the last store is required to be volatile.

Is this an optimization?

If this is happening, this is the most likely reason.

But has it some drawbacks?

Only is you rely on there being two store a barrier. i.e. you need to see field1 changed before field6 has change more often than it would happen by accident.

might still read an old value for field1 even if it has been actually changed?

yes though you will have no way of determining this has happened, but do you want to see a new value even while the other fields might not be set yet.

Colobus answered 23/4, 2016 at 13:37 Comment(0)

To answer question (1), you are correct with everything you've said about memory barriers etc (though the explanation is incomplete. A memory barrier ensures the ordering of ALL loads/stores before it, not just volatile ones). The code example is iffy though.

The thread performing memory operations should be ordering them. Using a volatile operation at the start of your code in this way, is redundant, as it doesn't provide any worthwhile assurances about ordering (I mean, it does provide assurances, they're just extremely fragile).

Consider this example;

public void thread1()
{
    //no assurances about ordering
    counter1 = someVal; //some non-volatile store
    counter2 = someVal; //some non-volatile store
}

public void thread2()
{
    flag += 1; //some volatile operation

    System.out.println(counter1);
    System.out.println(counter2);
}

No matter what we do on thread2, there are absolutely no assurances about what happens on thread1 - which is free to do pretty much whatever it wants. Even if you use volatile operations on thread1, the ordering wouldn't be seen by thread2.

To fix this, we need to order the writes on thread1 with a memory barrier (aka volatile operation);

public void thread1()
{
    counter1 = someVal; //some non-volatile store
    counter2 = someVal; //some non-volatile store

   //now we use a volatile write 
   //this ensures the order of our writes
   flag = true; //volatile operation

}

public void thread2()
{
   //thread1 has already ordered the memory operations (behind the flag)
   //therefore we don't actually need another memory barrier here
   if (flag)
   {
       //both counters have the right value now
   }
}

In this example, the ordering is handled by thread1 but depends on the state of flag. As such, we only need to check the state of flag, but you don't need another memory barrier on that read (aka, you need to check a volatile field, it just doesn't need a memory barrier).

So to answer your question (2): The JVM expects you to use a volatile operation to order previous operations on a given thread. The reason why there is no memory barrier on your first volatile operation, is simply because it has no bearing on whether or not your code will work (there might be situations where it could, but I can't think of any, let alone any where it would be a good idea).

Nickinickie answered 23/4, 2016 at 14:34 Comment(4)

Hi, thanks for your answer. You explained very well a valid usage of the volatile field and its happens-before guarantee as a form of synchronization. However, my example was more for reasoning on why in an experiment I had found in a blog post, the JIT compiler was acting that way (omitting a storeload after a volatile write). I agree with you that such code makes hardly any practical sense, but it's the code used in such blog to explain and illustrate some concepts, and I was just reasoning on it. – Simonette 23/4, 2016 at 15:48

I was trying to help with the reasoning <.<. My code isn't intended as an example of "this is how to use volatile", but rather, as the simplest example of how the concept of memory barriers applies in practical Java. The compiler only cares about how concepts apply in practice, and conceptually; a memory barrier simply ensures that A (ie store on one thread) happens before B (ie read on another thread). I couldn't answer your question without addressing this. – Nickinickie 23/4, 2016 at 17:1

but do you agree that if, for whatever reason / optimization, the compiler doesn't put a StoreLoad immediately after a volatile write, there might be a case in which another thread might see a stale value of that volatile, even if it is actually changed in the first thread? This is not breaking any happens before order, just delaying a bit the view of the updated value in other threads. Does it make sense? – Simonette 23/4, 2016 at 17:45

Yes, it could have implications for exactly when other threads see the value. But fundamentally, you can't translate the concept of membar to talk only about a single field/address. The concept of memory barriers refers only to the relationship between A and B, and it can't be translated to A and B independently of that relationship. – Nickinickie 24/4, 2016 at 1:35

I torched my earlier attempt to answer this because significant parts were wrong. Thanks to Peter Cordes for sending me to one resource that ballooned into a fascinating journey.

Overall answer: the order of the instructions and the single lock comes from a combination of requirements imposed by the Java memory model, the memory ordering of x86, memory coherence, and optimizations. Different platforms (e.g. with ARM instead of x86) have different memory ordering and instructions and thus might need different instructions in different locations. The JSR-133 Cookbook, admittedly known for being quite conservative and also outdated, has some good discussion of the four {Store,Load}{StoreLoad} combinations, why they exist, and on which platforms they're needed.

1) Does what I have written so far make sense? Or am I totally misinterpreting something?

What you've written is all true, except for the StoreLoad making the writes visible to other threads. Counterintuitively it turns out the StoreLoad is there to ensure the latest writes from other threads are seen by later loads in the current thread.

What I'll add is that Java's Memory Model that, in excruciating graph-theoretic detail, defines constraints that must be satisfied for the results of a program to be deemed "correct." It's a tough read with a lot of statements that sound like one thing but really mean something else (I'm looking at you, 'happens-before'!). Implementations of Java, in this case the instructions from a compiler plus hardware guarantees, must enforce those constraints, and otherwise are free to do whatever they want.

Reads and writes to volatiles are examples of what the model defines to be synchronization actions. A valid execution must have the following properties

there is a single total order over all synchronization actions across all threads, called the synchronization order (so)
all synchronization actions of a given thread in the so must occur in program order (called so-po consistency)
all reads of a volatile variable v must read the latest value written to v in the so

There's another set of rules called happens-before. Among other things

on a thread t, if action x occurs before action y in program order, then x happens-before y; the effect of x is visible to y. Despite being called "happens-before" it doesn't actually have to have happened before (e.g. reordering is allowed), it's just that the results have to be equivalent to it happening before
if x is a synchronized action that is observed by another synchronized action y (may be a different thread), then x happens-before y
and very importantly, if x happens-before y and y happens-before z, then the common-sense x happens-before z is true.

The last part is more fancily stated as "happens-before is the transitive closure of program order and synchronizes-with."

So in the example with volatile field1 and field6, and the rest non-volatile:

field1 must be written to memory before field6: so-po consistency
field2 though field5 must be written to memory before field6: these writes happens-before the write to field6, and any other thread that reads the new value of field6 will synchronize-with the write, and thus these writes to field2 to field5 must be visible to to all later operations
any later read of the volatile field1 or field6 of the current thread must read the latest value because 1. the synchronization order is a total order across all threads and 2. a volatile read of variable v must read the most recent volatile write to v

These requirements translate to the given x86 code as follows

we must write field1 before field6 to satisfy the so-po requirement; x86 does not reorder write instructions (thanks Peter!) so there's no extra instruction necessary to ensure this happens. Another ISA may require special instructions though
while writing to field2 happens-before field3 and so on (program order), the writes are independent. Any other order produces equivalent intra-thread actions so the compiler is free to reorder them (the memory model would call this "equivalent to intra-thread semantics"). Apparently that wonky order is more optimal than 2..5 in order.
if another thread t2 reads the the thread t1's write to field6, then the write to field6 synchronizes-with and happens-before the read. Happens-before is transitive: the preceding writes to field2 though field5 must happens-before the write to field6, and that write happens-before the later read of field6, so the field2 through field6 writes all must happens-before the read. All those writes in t1 must be visible to t2. Therefore all the writes to field2 through field5 must be done before writing to field6, and this is why field6 must come last. Again, x86's total store order means the write to field6 really will come last because it is the last instruction
and finally we need to make sure later reads of field1 and field6 read the latest value in the synchronization order over all threads. Without a barrier, a later read of field1 or field6 could see a write to field1 or field6 in the core's write buffer instead of the latest value in the global order. Therefore later reads of field1 and field6 must read from memory instead of the write buffer. The compiler implements that with the locked instruction which prevents that from happening. It could use another instruction like mfence but lock add[l] has been benchmarked as being generally faster.

So to recap we get

write to field1
write to field2 through field5 in some arbitrary order
write to field6
lock to prevent later reads of field1 and field6 in instruction order from reading possibly stale value(s) from the write buffer

2) Why is the compiler omitting a StoreLoad after field1 volatile assignment? Is this an optimization?

I believe it is. The lock is there to prevent later volatile reads of field1 and field2 in program order from seeing a value in the store buffer instead of the latest value in synchronization order / memory. My understanding is a locked instruction prevents reordering of all later reads past the lock. Therefore only one lock after field6 suffices. Having a second lock right after field1 wouldn't do anything helpful because there are no reads between writing field1 and field6, and all later reads already prevented from seeing a stale value by the lock after writing field6.

But has it some drawbacks? For example, another thread kicking in after field1 assignment, might still read an old value for field1 even if it has been actually changed?

I don't think this is possible because of memory consistency and cache coherence. Once the instruction to write the new value to the field1's memory (not register) location happens, it will be visible to all cores.

However it does have another drawback: lock is overkill. As far as I know it acts as a barrier to almost all reordering. But according to the JSR-133 Cookbook it's technically only needed to act as a barrier to later reads specifically of field1 and field6; it would be acceptable for later normal reads of field2 through field5 to get values from the store buffer. Another ISA might allow for finer-grained control. For example technically we only need to

make sure the write to field1 is visible to other threads at the same time or before the write to field6
the writes to field2 through field5 are visible to all threads at the same time or before the write to field6

For example we could write field4 before field1 if we really wanted to per the memory model, search for "lock coarsening" in this excellent presentation by Aleksey Shipilёv. The idea is that

if another thread t2 reads the old value of field1, then reads field4, it can expect to see either 0 (field1 and/or field4 not written yet) or 1 even with sequential consistency... and the Java memory model explicitly allows for intra-thread actions to appear out of order to other threads as long as all the requirements are met
if the other thread reads the new value of field1, it can still expect to read field4 as either 0 or 1 because there's a data race between t1 writing the new value of field4 and t2 reading field4 (in technical terms, the happens-before closure does not impose an order between t1's write to field4 and t2's read)

Therefore all four combinations are legal executions. Implementations are only required to produce a subset of all legal executions, so it could always write field4 before field1. That would make the possible executions of (read field1, read field4) by t2 be one of (0, 0), (0, 1), (1, 1).

For further reading and viewing, respectively, I suggest this other great set of slides and conference presentation by Aleksey Shipilёv. Also check out the JSR-133 Cookbook, which while somewhat outdated and conservative, does a good job explaining why the various barriers are needed in terms of the JMM, what instructions are needed on various platforms, and why. Note in particular the variety of ISAs, those ISAs' memory ordering guarantees, and the variety of instructions needed.

Gangster answered 2/9, 2024 at 9:7 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags