What JVM synchronization practices can I ignore assuming I know I will run on x64 cpus?

Asked 23/7, 2014 at 18:33 Answered 1/8, 2014 at 8:51

I know that the JVM memory model is made for lowest common denominator of CPUs, so it has to assume the weakest possible model of a cpu on which the JVM can run (eg ARM).

Now, considering that x64 has a fairly strong memory model, what synchronization practices can I ignore assuming I know my program will only run on 64bit x86 CPUs? Also does this apply when my program is being run through virtualization?

Example:
It is known that JVM's memory model requires synchronizing read/writes access to longs and doubles but one can assume that read/writes of other 32 bit primitives like int, float etc are atomic.

However, if i know that I am running on a 64 bit x86 machine, can i ignore using locks on longs/doubles knowing that the cpu will atomically read/write 64 bit values and just keep them volatile (like i would with ints/floats)?

Untinged answered 23/7, 2014 at 18:33 Comment(0)

I know that the JVM memory model is made for lowest common denominator of CPUs, so it has to assume the weakest possible model of a cpu on which the JVM can run (eg ARM).

That's not correct. The JMM resulted from a compromise among a variety of competing forces: the desire for a weaker memory model so that programs can go faster on hardware that have weak memory models; the desire of compiler writers who want certain optimizations to be allowed; and the desire for the result of parallel Java programs to be correct and predictable, and if possible(!) understandable to Java programmers. See Sarita Adve's CACM article for a general overview of memory model issues.

Considering that x64 has a fairly strong memory model, what synchronization practices can I ignore assuming I know my program will only run on [x64] CPUs?

None. The issue is that the memory model applies not only to the underlying hardware, but it also applies to the JVM that's executing your program, and mostly in practice, the JVM's JIT compiler. The compiler might decide to apply certain optimizations that are allowed within the memory model, but if your program is making unwarranted assumptions about the memory behavior based on the underlying hardware, your program will break.

You asked about x64 and atomic 64-bit writes. It may be that no word tearing will ever occur on an x64 machine. I doubt that any JIT compiler would tear a 64-bit value into 32-bit writes as an optimization, but you never know. However, it seems unlikely that you could use this feature to avoid synchronization or volatile fields in your program. Without these, writes to these variables might never become visible to other threads, or they could arbitrarily be re-ordered with respect to other writes, possibly leading to bugs in your program.

My advice is first to apply synchronization properly to get your program correct. You might be pleasantly surprised. The synchronization operations have been heavily optimized and can be very fast in the common case. If you find there are bottlenecks, consider using optimizations like lock splitting, the use of volatiles, or converting to non-blocking algorithms.

UPDATE

The OP has updated the question to be a bit more specific about using volatile instead of locks and synchronization.

It turns out that volatile not only has memory visibility semantics. It also makes long and double access atomic, which is not the case for non-volatile variables of those types. See the JLS section 17.7. You should be able to rely on volatile to provide atomicity on any hardware, not just x64.

While I'm at it, for additional information about the Java Memory Model, see Aleksey Shipilev's JMM Pragmatics talk transcript. (Aleksey is also the JMH guy.) There's lots of detail in this talk, and some interesting exercises to test one's understanding. One overall takeaway of the talk is that it's often a mistake to rely on one's intuition about how the memory model works, e.g. in terms of cache lines or write buffers. The JMM is a formalism about memory operations and various contraints (synchronizes-with, happens-before, etc.) that determine ordering of those operations. This can have quite counterintuitive results. It's unwise to try to outsmart the JMM by thinking about specific hardware properties. It'll come back to bite you.

Cockerham answered 26/7, 2014 at 17:55 Comment(8)

Great answer! It might be worth noting that most of those synchronization practices are nearly free... unless they are necessary. – Subtitle 28/7, 2014 at 14:23

@G.BlakeMeike Thanks! Yes, synchronization has been optimized considerably. I've added a note to this effect. But I never say "free" or "nearly free"... then somebody will say, "OK, can I have a billion of them?" :-) – Cockerham 28/7, 2014 at 19:5

A JIT compiler would not tear a 64-bit value into 32-bit writes intentionally on x64 but afaik the simple 64-bit access is atomic only if the storage location is properly aligned. So for non-volatile fields a JVM might use a more relaxed alignment leading to word tearing even on x64. That would not be an optimization but rather the absence of it. Or, well, it could be part of a memory-saving strategy… – Disney 29/7, 2014 at 16:24

Actually, I am not sure what problem the OP is getting at. If you are concerned with tearing 64bit writes to main memory, we are looking at a scenario where one core writes that value and the other one reads. If those cores share the same L1 cache, and Holger is correct, unaligned access may fail. If they are on different caches, the other core will observe only the result as it is written back to main memory (or the highest common cache level). Not sure how write-back operations work here. – Matheny 29/7, 2014 at 20:15

@Disney Sheer speculation here: a JIT compiler might peel some iterations from a for-loop over longs and only write the low-order 32 bits since it can presume the high-order 32 bits don't change. If another thread were to write a full 64-bit value, the result might be a 64-bit value that was never actually written by any thread. I have no idea if this has any practical value, but it would seem to be legal for a JIT to do this for a non-volatile long. – Cockerham 29/7, 2014 at 21:30

@RalfH Yeah I'm not sure what OP is after either. OP seems to want to make assumptions for Java programs based on the hardware it's running on, but it's hard for me to understand what advantage could be gained from doing that. The disadvantage is that the JIT might violate assumptions the OP is trying to make simply because it doesn't happen in hardware. – Cockerham 29/7, 2014 at 21:34

@Stuart Marks: it could have a practical use if the optimizer is capable of utilizing SSE. Then calculating in 32-Bit rather than 64-Bit means doubling the number of possible parallel computations. – Disney 30/7, 2014 at 8:52

@StuartMarks the advantage, if there is possibly one, is that one could write code that is cleaner/has fewer lines. of course the whole point of the post is to understand what, if any, such advantage is possible – Untinged 31/7, 2014 at 0:13

you would still need to handle thread-safety, so volatility semantics and memory fences will still matter

What I mean here is, eg in Oracle Java, most low-level sync operations end up in Unsafe (docjar.com/docs/api/sun/misc/Unsafe.html#getUnsafe), which in turn has a long list of native methods. So in the end, those synchronization practices and lots of other low-level operations are encapsuled by the JVM where they belong. x64 has not the same jvm as x86.

after reading your edited question again: the atomicity of your load/store operations was a topic here. So no, you don't have to worry about atomic 64bit load/stores on x64. But since this is not the end of all sync issues, see the other answers.

Matheny answered 23/7, 2014 at 18:38 Comment(0)

Always include the memory barriers where the JVM memory model states that they are needed and then let the JVM optimize them when it can for different platforms.

Knowing that you run only on x86 CPUs does not mean that you can drop using memory barriers. Unless perhaps you know that you will only run on single core x86 cpus ;) Which, in todays multi core world no body really knows.

Why? Because the java memory model has two main concerns.

visibility of data between cores and
happens before guarantees, aka re-ordering.

Without a memory barrier in play, the order of operations that become visible to other cores could become very confusing; and that is even with the stronger guarantees offered by x86. x86 only ensures consistency once the data makes it to the cpu caches, and while its ordering guarantees are very strong they only kick in once Hotspot has told the CPU to write out to the cache.

Without the volatile/synchronized then it will be up to the compilers (javac and hotspot) as to when they will do those writes and in what order. It is perfectly valid for them to decide to keep data for extended periods within the registers. Once a volatile or synchronized memory barrier is crossed, then the JVM knows to tell the CPU to send the data out to the cache.

As Doug Lea documents in the JSR-133 Cookbook most of the x86 barriers are reduced to no-op instructions that guarantee the ordering. Thus the JVM will make the instructions as efficient as possible for us. Code to the Java Memory Model, and let Hotspot work its magic. If Hotspot can prove that synchronised is not required, it can drop it entirely.

Lastly, the double checked locking pattern was proven to be broken on multi core x86 too; despite its stronger memory guarantees. Some nice detail of this was writen by Bartos Milewski on his C++ blog and again this time specific to Java here

Phio answered 1/8, 2014 at 8:51 Comment(0)

Compiler Writes Have taken care of What you wanted to do. Many of the volatile read/write barriers will eventually be no-op on x64. Also do think reordering may also be induced because of compiler optimization and may not depend on Hardware. For exmple benign data races - for example String hashCode. See : http://jeremymanson.blogspot.com/2008/12/benign-data-races-in-java.html

Also Please refer the page for what instructions may be no-op on x64. See : http://gee.cs.oswego.edu/dl/jmm/cookbook.html see Multiprocessors Section.

I will advise not to do any optimizations specific for hardware. You may end up writing Unmaintainable Code. Compiler Writers Have already put up sufficient HardWork.

Rave answered 23/7, 2014 at 22:48 Comment(2)

actually, reordering is done in the cpu after decoding the instructions from memory and building µops. It may be supported by proper compiler optimizations, though. These have to take into consideration the number of registers available, certain instruction timings and latencies and lots of other cpu-specific information. – Matheny 24/7, 2014 at 21:44

Please see 3rd point Control Flow Optimizations publib.boulder.ibm.com/infocenter/javasdk/v1r4m2/… – Rave 25/7, 2014 at 8:19

It not only depends on CPU, but also on the JVM, operating system etc.

One thing you might be sure: don't assume anything if it comes to thread synchronization.

Narda answered 30/7, 2014 at 6:56 Comment(0)

Recommended topics

Hot tags