Volatile variable and flushes to/reads from main memory
Asked Answered
C

2

9

Official notes say, that

Writing to a volatile field has the same memory effect as a monitor release, and reading from a volatile field has the same memory effect as a monitor acquire.

and

Effectively, the semantics of volatile have been strengthened substantially, almost to the level of synchronization. Each read or write of a volatile field acts like "half" a synchronization, for purposes of visibility.

from here.

Does that mean, that any write to a volatile variable makes executing thread flush its cache into main memory and every read from a volatile field makes the thread reread its variables from main memory?

I am asking because the very same text contains this statement

Important Note: Note that it is important for both threads to access the same volatile variable in order to properly set up the happens-before relationship. It is not the case that everything visible to thread A when it writes volatile field f becomes visible to thread B after it reads volatile field g. The release and acquire have to "match" (i.e., be performed on the same volatile field) to have the right semantics.

And this statement makes me very confused. I know for sure that it's not true for regular lock acquire and release with synchronized statement - if some thread releases any monitor then all changes it made become visibly to all other threads (Update: actually not true - watch best answer). There was even a question about it on stackoverflow. Yet it is stated that for whatever reason this is not the case for volatile fields. I can't imagine any implementation of happens-before guarantee, that doesn't make changes visible to other threads, threads that don't read the same volatile variable. At least imagine an implementation, that doesn't contradict the first two quotes.

Moreover before posting this question I did some research, and there is for example this article, which contains this sentence

After executing these instructions, all writes are visible to all other threads through cache subsystem or main memory.

mentioned instructions are the ones that happen when a write to volatile field is made.

So what's that important note is supposed to mean? Or am I am missing something? Or maybe that note is just plain wrong?

Answer?

After making some more research, I was only able to find this statement in official documentation about volatile fields and their effect on changes in non-volatile fields:

Using volatile variables reduces the risk of memory consistency errors, because any write to a volatile variable establishes a happens-before relationship with subsequent reads of that same variable. This means that changes to a volatile variable are always visible to other threads. What's more, it also means that when a thread reads a volatile variable, it sees not just the latest change to the volatile, but also the side effects of the code that led up the change.

from here.

I don't know if that is enough to conclude, that happens-before relation is guaranteed only for threads reading the same volatile. So for now I can only summarize that the results are inconclusive.

But in practice I would recommend considering that changes made by thread A, when it writes to a volatile field, are guaranteed to be visible to thread B only if thread B reads the same volatile field. The above quote from the official source strongly implies that.

Candracandy answered 5/8, 2018 at 10:26 Comment(5)
What is guaranteed if two threads synchronize on two different mutexes?Agglutinogen
After we exit a synchronized block, we release the monitor, which has the effect of flushing the cache to main memory, so that writes made by this thread can be visible to other threads. Same link. So this guarantees, that after a thread releases a lock, all other threads will see all changes made by that thread. ALL changes, not just the ones made inside the lock.Candracandy
I'm pretty sure nothing is "flushed" outside the CPU cache in 99% of implementation.Agglutinogen
Ok, what am I quoting even? This link is being quoted all over the site, but is it official JMM description or a description made by some smart yet random people?Candracandy
@NikKotovski You're quoting something totally valid. Bill Pugh was the lead writer for the JMM re-design in 2004 for Java 5. It was written at a time when cache flushing to main memory was still a thing. curiousguy isn't wrong when saying most implementation no longer flush right to memory. You can read up on cache coherence - a volatile store may simply notify the other CPUs of an update and share the writes with other CPUs in a case where it may never make it directly to memory if the field updates again.Teleutospore
F
4

You are looking at this from an entirely wrong angle. First you are quoting the JLS and than talking about flush, which would be an implementation detail of that specification. The absolute only thing you need to rely on is the JLS, anything else is not bad to know may be, but does not prove right or wrong the specification in any shape or form.

And the fundamental place where you are wrong is this:

I know for sure that it's not true for regular lock acquire...

In practice, on x86, you might be right, but the JLS and the official oracle tutorial mandates that:

When a thread releases an intrinsic lock, a happens-before relationship is established between that action and any subsequent acquisition of the same lock.

Happens-before is established for subsequent actions (if you want, read two actions if it is simpler for you). One thread releases the lock and the other acquires it - these are subsequent (release-acquire semantics).

Same things happens for a volatile - some thread writes to that, and when some other thread observes that write via a subsequent read, happens-before is established.

Fullblooded answered 12/8, 2018 at 12:34 Comment(6)
I agree, we should definitely not rely on such low-level implementation details.Wille
Well, thanks for your answer. That's quote is a type of common knowledge, and such behavior is stated everywhere, even in a lot of guides, so I never even thought about checking the specs. Thanks again.Candracandy
Also your quote from specifications implies strongly, that volatile works just the same - changes become visible to everyone upon volatile write. Not 100% accurate, but very likely. Also that means that volatile reads and writes INDEED likely work as a half of synchronization.Candracandy
Even on x86, you can not rely on the global nature of a thread leaving a synchronized block. While a memory barrier/cache flush is globally visible, relying on it without acquiring the lock may simply fail because of a) timing: the other thread hasn’t passed the barrier yet or b) lock elimination: the JVM has detected that no other thread will ever synchronize on that object, e.g. after Escape Analysis and hence, removed the memory barrier completely or c) some other not-so-obvious reason.Curbing
@Curbing But on x86 we can rely on the ordering. So if one thread loads then stores it is impossible for another thread to observe these operations in a different order (even without fences or lock prefixed)Wille
@Wille you are assuming that there has to be a load followed by a store on the CPU level, but this is not the case. A JVM’s optimizer is free to rearrange the code as long as the JVM specification is fulfilled, regardless of the CPU architecture. So if you’re not synchronizing on the same object or accessing the same volatile variable, there is no guaranty. The already mentioned lock elimination is already one practical example where a failure to synchronize on the right object can void any assumption about ordering. It’s not about the lock prefix—the access may not happen at all…Curbing
W
2

Does that mean, that any write to a volatile variable makes executing thread flush its cache into main memory and every read from a volatile field makes the thread reread its variables from main memory?

No, it does not mean that. And that's a common mistake to think that way. All it means is what is specified in the Java Memory Model.

On intel CPUs there are instruction to flush a cache line: clflush and clflushopt and it would be extremely inefficient to do that kind of flush of the whole cache line any time volatile write occurs.

To provide an example lets take a look how volatile variables implemented (for this example) by

Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)

for my Haswell. Let's write this simple example:

public static volatile long a = 0;

public static void main(String[] args){
    Thread t1 = new Thread(() -> {
        while(true){
            //to avoid DCE
            if(String.valueOf(String.valueOf(a).hashCode()).equals(String.valueOf(System.nanoTime()))){
                System.out.print(a);
            }
        }
    });

    Thread t2 = new Thread(() -> {
        while(true){
            inc();
        }
    });

    t1.start();
    t2.start();
}

public static void inc(){
    a++;
}

I disabled tiered compilation and ran it with C2 compiler as follows:

java -server -XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*Volatile.inc -jar target/test-0.0.1.jar

The output is the following:

  # {method} {0x00007f87d87c6620} 'inc' '()V' in 'com/test/Volatlee'
  #           [sp+0x20]  (sp of caller)
  0x00007f87d1085860: sub     $0x18,%rsp
  0x00007f87d1085867: mov     %rbp,0x10(%rsp)   ;*synchronization entry
                                                ; - com.test.Volatlee::inc@-1 (line 26)

  0x00007f87d108586c: movabs  $0x7191fab68,%r10  ;   {oop(a 'java/lang/Class' = 'com/test/Volatlee')}
  0x00007f87d1085876: mov     0x68(%r10),%r11
  0x00007f87d108587a: add     $0x1,%r11
  0x00007f87d108587e: mov     %r11,0x68(%r10)
  0x00007f87d1085882: lock addl $0x0,(%rsp)     ;*putstatic a
                                                ; - com.test.Volatlee::inc@5 (line 26)

  0x00007f87d1085887: add     $0x10,%rsp
  0x00007f87d108588b: pop     %rbp
  0x00007f87d108588c: test    %eax,0xca8376e(%rip)  ;   {poll_return}
  0x00007f87d1085892: retq
  ;tons of hlt ommited

So in this simple example volatile compiles to a locked instruction requiring cache line to have an exclusive state to be executed (probably sending read invalidate signal to other cores if it's not).

Wille answered 5/8, 2018 at 12:29 Comment(27)
Could you explain the last part in more detail. Not everyone (including me) is proficient in reading disassemble logs. Also one additional question: will other threads see changes made by t1 and t2?Candracandy
@NikKotovski The last part is the C2-version of the incrementing volatile variable. mov 0x68(%r10),%r11 means reading the field a into the register r11. Now r11 contains the actual value of the field a. add $0x1,%r11 - the increment operation that you can see in the code. mov %r11,0x68(%r10) - putting back the incremented value into main memory (actually cache, transferring line into an Exclusive state if it was not already and sending read invalidate signal). lock addl $0x0,(%rsp) - actually adds 0x0 it acts as a write barrier, but maybe more efficient...Wille
@NikKotovski add $0x10,%rsp, pop %rbp is the satandard assembly boilerplate restoring the stack as before the function call.Wille
So does putting stuff into exclusive state and sending read invalidate signal affect all threads, not just t1 and t2. I mean I presume these actions make t1 reread a value instead of using the one stored inside t1 cache. But if we had another thread, that had access to a, would it have received the very same signal for reevaluation?Candracandy
@NikKotovski Yes, everyone who has a in its cache receives read-invalidate signal. So the writer will have it Modified/Exclusive and others will have the it in Invalidate. Minor: I would not say "all threads", but "all cores/CPUs".Wille
@Wille That's a very good answer. I'm removing mine. Yours is better.Peirsen
Yes, thank you very much for your answer @Wille Just one more question: so implementation-wise only threads that make reads from a volatile are guaranteed to see changes to non-volatiles that are located before that volatile in the code, right?Candracandy
@NikKotovski Are you talking about reordering volatile/non-volatile writing and reading? Then I would say no, because lock has weaker "fencing guarantees" than the mfence. In particular I read this and did not found if it does any kind of serialization (related to CPU store buffer for example). Whereas it is explicitly specified for mfence.Wille
@NikKotovski Actually a bit of googling gave me that lock performs a bit better then mfence. Take a look at this. But as I already mentioned mfence serializes, lock probably no...Wille
@Wille I mean visibility of non-volatiles. Reads/writes to non-volatiles located before a volatile write are guaranteed to be seen by other thread, which is using that volatile. But will other threads, that don't access that volatile, see them reads/writes to those non-volatiles when some thread writes into that volatile? That's what my topic question is about.Candracandy
@Wille I suspect that theses changes HAVE to be made visible to every thread. Because at any time a new thread with access to a volatile field might get started, and the new thread needs to see all the changes too.Candracandy
@NikKotovski Ah, now I see what you mean. That's a good question actually. And I think the answer is no. Because the non-volatiles may be stored in different cache lines. Let's read intel manual. In a multiprocessor environment, the LOCK# signal ensures that the processor has exclusive use of any shared memory while the signal is asserted. and The LOCK prefix can be prepended only to the following instructions and only to those forms of the instructions where the destination operand is a memory operand. So it relates to the lines containing memory you referred to in the instruction.Wille
@NikKotovski We are talking about the generated assembly code right?Wille
Let us continue this discussion in chat.Wille
@Wille very good answer overall, nitpick -server is useless... by the way you also might want to specify -XX:CICompilerCount=1Fullblooded
@Wille I admit I very rarely look (and understand the assembly) and this from you was very nice... I've tried to answer too...Fullblooded
@Fullblooded What's still not clear to me is that it compiles to the locked instruction on the stack pointer which is not related to the field the volatile write occurred to. AFAIK on intel CPUs this lock addl $0x0, (%rsp) is useless from the memory visibility standpoint in this specific case.Wille
@Wille not very sure I follow you here, lock addl is a replacement for a StoreLoad and the JVM will insert these before after the volatile store and before the volatile load... no idea if this answers your concerns though. If not can you be a little more explicit please - this may be interesting (probably). thank you!Fullblooded
@Fullblooded I meant that x86 already have strong memory ordering guaranties. In particular it is impossible that stores (mov %r11,0x68(%r10) in our case) is reordered with earlier load (mov 0x68(%r10),%r11 in our case). Check out Intel manual the example 8.2.3.3. This is the reason I said the locked instruction on the stack pointer is not clear to me in this particular case.Wille
@Fullblooded Btw, volatile read in this example compiles to no locked instructions or mfences. And this is clear since x86 already prevents this.Wille
@Wille AFAIK that would be a valid strategy in x86, but that is a choice that JIT does - emitting the StoreLoad after each volatile storeFullblooded
@Fullblooded Do you mean LoadStore, not StoreLoad? It would make sense if the lock appears in between the load and the store. I just consulted the cookbook and there was an example similar to one I used in the answer.Wille
@Fullblooded Ah.. sorry I misuderstood what you were talking about at the first time.Wille
@Eugene: yup, the JVM is using a locked instruction (accessing the stack) as a full barrier cheaper than mfence, to make the mov-store into a sequential-release. In C++ terminology, it promotes the operation from acq_rel to seq_cst, given the standard mapping from ordering to x86 barriers (with seq_cst draining the store buffer after stores, instead of before loads because cheap loads are more important). This answer implies that the lock applies to the cache line holding the volatile but that's not the case.Shu
All stores require cache to be in Exclusive state to commit, it's the store buffer that the JVM is worried about here. (For StoreLoad ordering). If it wanted to hold exclusive ownership of the cache line containing a for the whole operation, it would use lock addl $1, 0x68(%r11). In fact it might be more efficient to do that anyway instead of a non-atomic increment and then a barrier. Or at least do the seq-cst store with xchg instead of mov + dummy lock add to the stack. Or at least do the non-atomic increment with a memory-dest add (without lock) saving a uop vs mov/add/mov.Shu
@PeterCordes Since Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions. and For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete) is there any case where fake lock add [rsp], 0 cannot be used instead of mfence? In JVM locked add is used for performance reason (as stated in the comment)Wille
@St.Antario: No cases worth caring about. Maybe some corner cases for drivers that do weakly-ordered NT loads from WC memory (e.g. video RAM), on CPUs with bugs. (If that is a bug, instead of an intentional difference between the on-paper guarantees of lock and mfence.) Does lock xchg have the same behavior as mfence? has more about the real-world situation vs. docs and published errata, and why mfence is so slow especially on SKL. If you don't use an WC memory (i.e. every normal program, including a JVM), they're equivalent.Shu

© 2022 - 2024 — McMap. All rights reserved.