Java 8 Unsafe: xxxFence() instructions
Asked Answered
H

4

52

In Java 8 three memory barrier instructions were added to Unsafe class (source):

/**
 * Ensures lack of reordering of loads before the fence
 * with loads or stores after the fence.
 */
void loadFence();

/**
 * Ensures lack of reordering of stores before the fence
 * with loads or stores after the fence.
 */
void storeFence();

/**
 * Ensures lack of reordering of loads or stores before the fence
 * with loads or stores after the fence.
 */
void fullFence();

If we define memory barrier with the following way (which I consider more or less easy to understand):

Consider X and Y to be operation types/classes that are subject for reordering,

X_YFence() is a memory barrier instruction that ensures that all operations of type X before the barrier completed before any operation of type Y after the barrier is started.

We can now "map" barrier names from Unsafe to this terminology:

  • loadFence() becomes load_loadstoreFence();
  • storeFence() becomes store_loadStoreFence();
  • fullFence() becomes loadstore_loadstoreFence();

Finally, my question is - why don't we have load_storeFence(), store_loadFence(), store_storeFence() and load_loadFence()?

My guess would be - they are not really neccesary, but I do not understand why at the moment. So, I'd like to know reasons for not adding them. Guesses about that are welcome too (hope this doesn't cause this question to be offtopic as opinion-based, though).

Thanks in advance.

Hernandez answered 12/5, 2014 at 7:28 Comment(3)
Not a c++ specialist but the unsafe instructions probably just map what is available in c++ which probably just maps what is available in standard assembly.Bagpipes
@Bagpipes Not ready to prove that, but I think c++ instructions can be quite flexible and allow different types of barriers.. Following the common logic, it should easier to forbid only loads, then both loads and stores..Hernandez
In the end it is linked to what instructions are available at the processor level. See for example: gee.cs.oswego.edu/dl/jmm/cookbook.htmlBagpipes
T
65

Summary

CPU cores have special memory ordering buffers to assist them with out-of-order execution. These can be (and typically are) separate for loading and storing: LOBs for load-order buffers and SOBs for store-order buffers.

The fencing operations chosen for the Unsafe API are were selected based on the following assumption: underlying processors will have separate load-order buffers (for reordering loads), store-order buffers (for reordering stores).

Therefore, based on this assumption, from a software point of view, you can request one of three things from the CPU:

  1. Empty the LOBs (loadFence): means that no other instructions will start executing on this core, until ALL entries the LOBs have been processed. In x86 this is an LFENCE.
  2. Empty the SOBs (storeFence): means that no other instructions will start executing on this core, until ALL entries in the SOBs have been processed. In x86 this is an SFENCE.
  3. Empty both LOBs and SOBs(fullFence): means both of the above. In x86 this is an MFENCE.

In reality, each specific processor architecture provides different memory ordering guarantees, which may be more stringent, or more flexible than the above. For example, SPARC architecture can reorder load-store and store-load sequences, whereas x86 will not do that. Furthermore, architectures exist where LOBs and SOBs cannot be controlled individually (i.e. only full-fence is possible). In both cases however:

  • when the architecture is more flexible, the API simply does not provide access to the "laxer" sequencing combinations as a matter of choise

  • when the architecture is more stringent, the API simply implements the more stringent sequencing guarantee in all cases (e.g. all 3 calls actually and up being implemented as a full fence)

The reason for the particular API choices is explained in the JEP as per the answer assylias provides which is 100% on-the-spot. If you know about memory ordering and cache coherence, assylias' answer should suffice. I think the fact that they match the standardized instruction in the C++ API was a major factor (simplifies JVM implementation a lot): http://en.cppreference.com/w/cpp/atomic/memory_order In all likelihood, actual implementation will call into the respective C++ API instead of using some special instruction.

Below I have a detailed explanation with x86-based examples, which will provide all the context necessary to understand these things. In fact, the demarcated (section below answers another question: "Can you provide basic examples of how memory fences work to control cache coherence in the x86 architecture?"

The reason for this is that I myself (coming from a software developer and not hardware designer) had trouble understanding what memory reordering is, until I learned specific examples of how cache coherence actually works in x86. This provides invaluable context for discussing memory fences in general (for other architectures as well). At the end I discuss SPARC a bit using the knowledge gained from the x86 examples

The reference [1] is an even more detailed explanation and has a separate section for discussing each of: x86, SPARC, ARM and PowerPC, so it is an excellent read if you are interested in more details.


x86 architecture example

x86 provides 3 types of fencing instructions: LFENCE (load fence), SFENCE (store fence) and MFENCE (load-store fence), so it maps 100% to the Java API.

This is because x86 has separate load-order buffers (LOBs) and store-order buffers (SOBs), so indeed LFENCE/SFENCE instructions apply to the respective buffer, whereas MFENCE applies to both.

SOBs are used to store an outgoing value (from processor to cache system) while the cache coherence protocol works to acquire permission to write to the cache line. LOBs are used to store invalidation requests so that invalidation can execute asynchronously (reduces stalling on the receiving side in the hope that the code executing there will not actually need that value).

Out-of-order stores and SFENCE

Suppose you have a dual processor system with its two CPUs, 0 and 1, executing the routines below. Consider the case where the cache line holding failure is initially owned by CPU 1, whereas the cache line holding shutdown is initially owned by CPU 0.

// CPU 0:
void shutDownWithFailure(void)
{
  failure = 1; // must use SOB as this is owned by CPU 1
  shutdown = 1; // can execute immediately as it is owned be CPU 0
}
// CPU1:
void workLoop(void)
{
  while (shutdown == 0) { ... }
  if (failure) { ...}
}

In the absence of a store fence, CPU 0 may signal a shutdown due to failure, but CPU 1 will exit the loop and NOT got into the failure-handling if block.

This is because CPU0 will write the value 1 for failure to a store-order buffer, also sending out a cache coherence message to acquire exclusive access to the cache line. It will then proceed to the next instruction (while waiting for exclusive access) and update the shutdown flag immediately (this cache line is owned exclusively by CPU0 already so no need to negotiate with other cores). Finally, when it later receives an invalidation confirmation message from CPU1 (regarding failure) it will proceed to process the SOB for failure and write the value to the cache (but the order is by now reversed).

Inserting a storeFence() will fix things:

// CPU 0:
void shutDownWithFailure(void)
{
  failure = 1; // must use SOB as this is owned by CPU 1
  SFENCE // next instruction will execute after all SOBs are processed
  shutdown = 1; // can execute immediately as it is owned be CPU 0
}
// CPU1:
void workLoop(void)
{
  while (shutdown == 0) { ... }
  if (failure) { ...}
}

A final aspect that deserves mention is that x86 has store-forwarding: when a CPU writes a value which gets stuck in an SOB (due to cache coherence), it may subsequently attempt to execute a load instruction for the same address BEFORE the SOB is processed and delivered to the cache. CPUs will therefore consult the SOBs PRIOR to accessing the cache, so the value retrieved in this case is the last-written value from the SOB. this means that stores from THIS core can never be reordered with subsequent loads from THIS core no matter what.

Out-of-order loads and LFENCE

Now, assume you have the store fence in place and are happy that shutdown cannot overtake failure on its way to CPU 1, and focus on the other side. Even in the presence of the store fence, there are scenarios where the wrong thing happens. Consider the case where failure is in both caches (shared) whereas shutdown is only present in and owned exclusively by the cache of CPU0. Bad things can happen as follows:

  1. CPU0 writes 1 to failure; It also sends a message to CPU1 to invalidate its copy of the shared cache line as part of the cache coherence protocol.
  2. CPU0 executes the SFENCE and stalls, waiting for the SOB used for failure to commit.
  3. CPU1 checks shutdown due to the while loop and (realizing it is missing the value) sends a cache coherence message to read the value.
  4. CPU1 receives the message from CPU0 in step 1 to invalidate failure, sending an immediate acknowledgement for it. NOTE: this is implemented using the invalidation queue, so in fact it simply enters a note (allocates an entry in its LOB) to later do the invalidation, but does not actually perform it before sending out the acknowledgement.
  5. CPU0 receives the acknowledgement for failure and proceeds past the SFENCE to the next instruction
  6. CPU0 writes 1 to shutdown without using a SOB, because it already owns the cache line exclusively. no extra message for invalidation is sent as the cache line is exclusive to CPU0
  7. CPU1 receives the shutdown value and commits it to its local cache, proceeding to the next line.
  8. CPU1 checks the failure value for the if statement, but since the invalidate queue (LOB note) is not yet processed, it uses the value 0 from its local cache (does not enter if block).
  9. CPU1 processes the invalidate queue and update failure to 1, but it is already too late...

What we refer to as load order buffers, is actaully the queueing of invalidation requests, and the above can be fixed with:

// CPU 0:
void shutDownWithFailure(void)
{
  failure = 1; // must use SOB as this is owned by CPU 1
  SFENCE // next instruction will execute after all SOBs are processed
  shutdown = 1; // can execute immediately as it is owned be CPU 0
}
// CPU1:
void workLoop(void)
{
  while (shutdown == 0) { ... }
  LFENCE // next instruction will execute after all LOBs are processed
  if (failure) { ...}
}

Your question on x86

Now that you know what SOBs/LOBs do, think about the combinations you mentioned:

loadFence() becomes load_loadstoreFence();

No, a load fence waits for LOBs to be processed, essentially emptying the invalidation queue. This means that all subsequent loads will see up-to-date data (no re-ordering), as they will be fetched from the cache sub-system (which is coherent). Stores CANNNOT be reordered with subsequent loads, because they do not go through the LOB. (and furthermore store forwarding takes care of locally-modified cachce lines) From the perspective of THIS particular core (the one executing the load fence), a store that follows the load fence will execute AFTER all registers have the data loaded. There is no way around it.

load_storeFence() becomes ???

There is no need for a load_storeFence as it does not make sense. To store something you must calculate it using input. To fetch input you must execute loads. The stores will occur using the data fetched from loads. If you want to make sure you see up-to-date values from all OTHER processors when loading use a loadFence. For loads after the fence store-forwarding takes care of consistent ordering.

All other cases are similar.


SPARC

SPARC is even more flexible and can reorder stores with subsequent loads (and loads with subsequent stores). I was not as familiar with SPARC, so my GUESS was that there is no store-forwarding (SOBs are not consulted when reloading an address) so "dirty reads" are possible. In fact I was wrong: I found the SPARC architecture in [3] and the reality is that store-forwarding is threaded. From section 5.3.4:

All loads check the store buffer (same thread only) for read after write (RAW) hazards. A full RAW occurs when the dword address of the load matches that of a store in the STB and all bytes of the load are valid in the store buffer. A partial RAW occurs when the dword addresses match, but all bytes are not valid in the store buffer. (Ex., a ST (word store) followed by an LDX (dword load) to the same address results in a partial RAW, because the full dword is not in the store buffer entry.)

So, different threads consult different store-order buffers hence the possibility for dirty reads after stores.


References

[1] Memory Barriers: a Hardware View for Software Hackers, Linux Technology Center, IBM Beaverton http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf

[2] Intel® 64 and IA-32 ArchitecturesSoftware Developer’s Manual, Volume 3A http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf

[3] OpenSPARC T2 Core Microarchitecture Specification http://www.oracle.com/technetwork/systems/opensparc/t2-06-opensparct2-core-microarch-1537749.html

Tehee answered 20/5, 2014 at 22:11 Comment(17)
Thanks for such a detailed answer, really appreciated it. I will read it with high attention tomorrow with fresh brain :)Hernandez
I strongly suggest you read the first reference until you fully understand this paragraph on page 13: "The effect of this is that a read memory barrier orders ONLY loads on the CPU that executes it, so that ..."Tehee
@Tehee Your demonstration is based on the x86 memory model which is somewhat stronger than alternative architectures, say ARM, Alpha or PowerPC. I believe some of these archtecture do have finer instructions and that what the op proposed would make sense on those.Bagpipes
@Bagpipes I wrote about x86, because that is the architecture I am most familiar with (and most other people). With so many architectures, a separate question would be needed for each. The reference [1] explains arch-specific stuff. Please remember that Java needs to be implemented across MOST architectures. Where separate LFENCE/SFENCE are not supported (ARM is probably like this), they are all implemented as MFENCE. Where even more fine-grained control exists (SPARC does have StoreLoad and LoadStore) then the primitive is not available in Java (portability is rated more important).Tehee
@Tehee Actually on x86 some of the Java synchronization operations are translated into no-ops because the processor gives stronger guarantees than required by the language. You could imagine that Java could have StoreLoad or LoadStore instructions and that for those cpus where these are not supported, it would be translated into the closest available instruction that provides at least the required semantics. What I'm just trying to say is that they could have gone for a finer grained implementation but there must be good reasons why they did not.Bagpipes
I get your point and I agree. That is what I meant with "Where even more fine-grained control exists (SPARC does have StoreLoad and LoadStore) then the primitive is not available in Java (portability is rated more important)" in the comment above. In fact, after that comment I edited the post to reflect this in the first 2 paragraphs. SPARC is such a case (again, reference [1] explains most of these), but providing an example for each architecture would take forever and make the answer 10 times longer. A separate question about "how does cache coherence work in arch X would be required".Tehee
Ok, now I managed to read this. Great explanation indeed and most likely I'll give you the bounty (but in five days). Although, I have some questions yet: you wrote that SPARC architecture has LoadStore and StoreLoad fences. Later, in the bottom of the answer, you wrote that stores cannot be reordered with loads. Can you please add some explanation about the meaning of this fences? Second: you added a disclaimer that this is mostly about x86 architecture. If you can reference somewhere like "hey, for other arch this particular fact may not be true" I'd also be very grateful. Thanks :)Hernandez
Great explanation. Very helpfulBury
You wrote no other instructions will start executing on this core, until ALL entries in the SOBs have been processed - but that's very much not what SFENCE does. That description would make it a full barrier, draining the store buffer before later loads (or stores) can even execute. Only MFENCE or a locked operation drain the store buffer before letting later loads execute. x86's asm memory model is program order + a store buffer with store forwarding so SFENCE and LFENCE are useless for memory ordering except for movnt weakly-ordered stores.Travesty
Store barriers only block later stores from committing to cache before earlier stores are visible. On every architecture AFAIK, any barrier that includes blocking StoreLoad reordering (draining the store buffer before later loads), that barrier is also a full barrier that blocks everything else. StoreLoad is much more expensive than anything else. preshing.com/20120710/…Travesty
Anyway, the key point here is that a Java storeFence() or loadFence() only has to block compile-time reordering when targeting x86. Just like C++ std::atomic_thread_fence(std::memory_order_acq_rel) compiles to zero asm instructions, only a "compiler barrier" that blocks reordering at compile time, Java can only has to do the same thing to get the underlying HW's memory-ordering model. (Related: When should I use _mm_sfence _mm_lfence and _mm_mfence covers the diff between compile-time vs. runtime ordering.)Travesty
preshing.com/20120625/memory-ordering-at-compile-timeTravesty
You say for SPARC: So, different threads consult different store-order buffers hence the possibility for dirty reads after stores. that's not unique to SPARC; every ISA has a different store buffer for each logical core (hardware thread). This is the reason / mechanism behind x86's reordering stores with later loads (StoreLoad). And BTW, it would be very strange not to have store-forwarding (or at a stall in a low-performance design) for a RAW hazard where you reload a recent store. The cardinal rule of OoO / pipeline / superscalar execution is not to break single-threaded code.Travesty
@PeterCordes: Not a SPARC expert, but from [3] my understanding is this not about SMT and logical cores. In SMT when a logical core switches to another thread, that thread will uses the same register set that the previous one was using on that same logical core (the OS saves/restores the values in the registers as part of context switch). SPARC has multiple distinct register-sets and flips the "active window" so the other thread resumes on a distinct set of registers. This is to speed up context switching (OS does not need to save/restore register values).Tehee
@Alexandros: I'm not a SPARC expert either, but a hardware-assisted context switch that's still done by software would have a different thread running on the same core, so it should "see" earlier stores. (i.e. share a store buffer for the purposes of store-forwarding.) i.e. the banked registers are part of the architectural state of one core. None of this depends on whether the core is one of multiple that share the same physical core or just a simple physical core with no SMT. SMT is transparent to the ISA model of what a core is.Travesty
I don't think it's plausible for kernel code to be able to do anything that tells the hardware "don't bother forwarding from any in-flight stores; just let them appear in L1d cache at some point". IDK why you'd ever want that, and it wouldn't be likely to speed up execution unless you somehow still have un-executed instructions as well so the store-data isn't ready to forward from. But that would require renaming the privilege level, except for switching between kernel tasks. It would be hard to use safely, like you'd probably need a memory barrier to drain the SB to be safe, defeating itTravesty
Anyway, UltraSPARC T2 (aka Niagara 2) does have 8 hardware threads per physical core. T1 was a "barrel processor, rotating the front-end between 4 threads (each with their own architectural state). This is fairly fine-grained SMT, and yes each logical core aka thread needs its own store buffer, I think even for in-order cores like Niagara. By this point they may have wanted to support the SPARC TSO memory ordering model which like x86 requires stores to commit in program order, so they can't just let graduated stores become visible to allTravesty
B
8

A good source of information is the JEP 171 itself.

Rationale:

The three methods provide the three different kinds of memory fences that some compilers and processors need to ensure that particular accesses (loads and stores) do not become reordered.

Implementation (extract):

for the C++ runtime versions (in prims/unsafe.cpp), implementing via the existing OrderAccess methods:

    loadFence:  { OrderAccess::acquire(); }
    storeFence: { OrderAccess::release(); }
    fullFence:  { OrderAccess::fence(); }

In other words the new methods closely relate to how memory fences are implemented at the JVM and CPU levels. They also match the memory barrier instructions available in C++, the language in which hotspot is implemented.

A finer grained approach would probably have been feasible but the benefits are not obvious.

For example if you look at the table of cpu instructions in the JSR 133 Cookbook, you will see that LoadStore and LoadLoad map to the same instructions on most architectures, i.e. both are effectively Load_LoadStore instructions. So having a single Load_LoadStore (loadFence) instruction at the JVM level seems like a reasonable design decision.

Bagpipes answered 21/5, 2014 at 11:48 Comment(0)
A
6

The doc for storeFence() is wrong. See https://bugs.openjdk.java.net/browse/JDK-8038978

loadFence() is LoadLoad plus LoadStore, so useful often called acquire fence.

storeFence() is StoreStore plus LoadStore, so useful often called release fence.

LoadLoad LoadStore StoreStore are cheap fences (nop on x86 or Sparc, cheap on Power, maybe expensive on ARM).

IA64 has different instructions for acquire and release semantics.

fullFence() is LoadLoad LoadStore StoreStore plus StoreLoad.

StordLoad fence is expensive (on almost all CPU), almost as expensive as full fence.

That justifies the API design.

Arytenoid answered 2/6, 2015 at 18:30 Comment(1)
See also preshing.com/20120930/weak-vs-strong-memory-modelsTravesty
B
0

It looks that storeFence() should be mapped to "loadStore_storeFence" according to the comments from source code:

https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/jdk/internal/misc/Unsafe.java#L3422

/**
 * Ensures that loads and stores before the fence will not be reordered with
 * stores after the fence; a "StoreStore plus LoadStore barrier".
 * ...
 * /
Beaverbrook answered 5/5, 2020 at 14:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.