I torched my earlier attempt to answer this because significant parts were wrong. Thanks to Peter Cordes for sending me to one resource that ballooned into a fascinating journey.
Overall answer: the order of the instructions and the single lock
comes from a combination of requirements imposed by the Java memory model, the memory ordering of x86, memory coherence, and optimizations. Different platforms (e.g. with ARM instead of x86) have different memory ordering and instructions and thus might need different instructions in different locations. The JSR-133 Cookbook, admittedly known for being quite conservative and also outdated, has some good discussion of the four {Store,Load}{StoreLoad}
combinations, why they exist, and on which platforms they're needed.
1) Does what I have written so far make sense? Or am I totally misinterpreting something?
What you've written is all true, except for the StoreLoad
making the writes visible to other threads. Counterintuitively it turns out the StoreLoad
is there to ensure the latest writes from other threads are seen by later loads in the current thread.
What I'll add is that Java's Memory Model that, in excruciating graph-theoretic detail, defines constraints that must be satisfied for the results of a program to be deemed "correct." It's a tough read with a lot of statements that sound like one thing but really mean something else (I'm looking at you, 'happens-before'!). Implementations of Java, in this case the instructions from a compiler plus hardware guarantees, must enforce those constraints, and otherwise are free to do whatever they want.
Reads and writes to volatiles are examples of what the model defines to be synchronization actions. A valid execution must have the following properties
- there is a single total order over all synchronization actions across all threads, called the synchronization order (so)
- all synchronization actions of a given thread in the so must occur in program order (called so-po consistency)
- all reads of a volatile variable
v
must read the latest value written to v
in the so
There's another set of rules called happens-before. Among other things
- on a thread
t
, if action x
occurs before action y
in program order, then x
happens-before y
; the effect of x
is visible to y
. Despite being called "happens-before" it doesn't actually have to have happened before (e.g. reordering is allowed), it's just that the results have to be equivalent to it happening before
- if
x
is a synchronized action that is observed by another synchronized action y
(may be a different thread), then x
happens-before y
- and very importantly, if
x
happens-before y
and y
happens-before z
, then the common-sense x
happens-before z
is true.
The last part is more fancily stated as "happens-before is the transitive closure of program order and synchronizes-with."
So in the example with volatile field1
and field6
, and the rest non-volatile:
field1
must be written to memory before field6
: so-po consistency
field2
though field5
must be written to memory before field6
: these writes happens-before the write to field6
, and any other thread that reads the new value of field6
will synchronize-with the write, and thus these writes to field2
to field5
must be visible to to all later operations
- any later read of the volatile
field1
or field6
of the current thread must read the latest value because 1. the synchronization order is a total order across all threads and 2. a volatile read of variable v
must read the most recent volatile write to v
These requirements translate to the given x86 code as follows
- we must write
field1
before field6
to satisfy the so-po requirement; x86 does not reorder write instructions (thanks Peter!) so there's no extra instruction necessary to ensure this happens. Another ISA may require special instructions though
- while writing to
field2
happens-before field3
and so on (program order), the writes are independent. Any other order produces equivalent intra-thread actions so the compiler is free to reorder them (the memory model would call this "equivalent to intra-thread semantics"). Apparently that wonky order is more optimal than 2..5 in order.
- if another thread
t2
reads the the thread t1
's write to field6
, then the write to field6
synchronizes-with and happens-before the read. Happens-before is transitive: the preceding writes to field2
though field5
must happens-before the write to field6
, and that write happens-before the later read of field6
, so the field2
through field6
writes all must happens-before the read. All those writes in t1
must be visible to t2
. Therefore all the writes to field2
through field5
must be done before writing to field6
, and this is why field6
must come last. Again, x86's total store order means the write to field6
really will come last because it is the last instruction
- and finally we need to make sure later reads of
field1
and field6
read the latest value in the synchronization order over all threads. Without a barrier, a later read of field1
or field6
could see a write to field1
or field6
in the core's write buffer instead of the latest value in the global order. Therefore later reads of field1
and field6
must read from memory instead of the write buffer. The compiler implements that with the lock
ed instruction which prevents that from happening. It could use another instruction like mfence
but lock add[l]
has been benchmarked as being generally faster.
So to recap we get
- write to
field1
- write to
field2
through field5
in some arbitrary order
- write to
field6
lock
to prevent later reads of field1
and field6
in instruction order from reading possibly stale value(s) from the write buffer
2) Why is the compiler omitting a StoreLoad
after field1
volatile assignment? Is this an optimization?
I believe it is. The lock
is there to prevent later volatile reads of field1
and field2
in program order from seeing a value in the store buffer instead of the latest value in synchronization order / memory. My understanding is a lock
ed instruction prevents reordering of all later reads past the lock
. Therefore only one lock
after field6
suffices. Having a second lock
right after field1
wouldn't do anything helpful because there are no reads between writing field1
and field6
, and all later reads already prevented from seeing a stale value by the lock
after writing field6
.
But has it some drawbacks? For example, another thread kicking in after field1
assignment, might still read an old value for field1
even if it has been actually changed?
I don't think this is possible because of memory consistency and cache coherence. Once the instruction to write the new value to the field1
's memory (not register) location happens, it will be visible to all cores.
However it does have another drawback: lock
is overkill. As far as I know it acts as a barrier to almost all reordering. But according to the JSR-133 Cookbook it's technically only needed to act as a barrier to later reads specifically of field1
and field6
; it would be acceptable for later normal reads of field2
through field5
to get values from the store buffer. Another ISA might allow for finer-grained control. For example technically we only need to
- make sure the write to
field1
is visible to other threads at the same time or before the write to field6
- the writes to
field2
through field5
are visible to all threads at the same time or before the write to field6
For example we could write field4
before field1
if we really wanted to per the memory model, search for "lock coarsening" in this excellent presentation by Aleksey Shipilёv. The idea is that
- if another thread
t2
reads the old value of field1
, then reads field4
, it can expect to see either 0 (field1
and/or field4
not written yet) or 1 even with sequential consistency... and the Java memory model explicitly allows for intra-thread actions to appear out of order to other threads as long as all the requirements are met
- if the other thread reads the new value of
field1
, it can still expect to read field4
as either 0 or 1 because there's a data race between t1
writing the new value of field4
and t2
reading field4
(in technical terms, the happens-before closure does not impose an order between t1
's write to field4
and t2
's read)
Therefore all four combinations are legal executions. Implementations are only required to produce a subset of all legal executions, so it could always write field4
before field1
. That would make the possible executions of (read field1, read field4)
by t2
be one of (0, 0), (0, 1), (1, 1)
.
For further reading and viewing, respectively, I suggest this other great set of slides and conference presentation by Aleksey Shipilёv. Also check out the JSR-133 Cookbook, which while somewhat outdated and conservative, does a good job explaining why the various barriers are needed in terms of the JMM, what instructions are needed on various platforms, and why. Note in particular the variety of ISAs, those ISAs' memory ordering guarantees, and the variety of instructions needed.
volatile
write semantics isn't just visibility, it's to get sequential consistency (for data-race-free programs), like C++foo.store(val, seq_cst)
. A plain store in x86 asm is likefoo.store(val, release)
. – Bloodstock