Alignment requirements for atomic x86 instructions vs. MS's InterlockedCompareExchange documentation?
Asked Answered
P

4

38

Microsoft offers the InterlockedCompareExchange function for performing atomic compare-and-swap operations. There is also an _InterlockedCompareExchange intrinsic.

On x86 these are implemented using the lock cmpxchg instruction.

However, reading through the documentation on these three approaches, they don't seem to agree on the alignment requirements.

Intel's reference manual says nothing about alignment (other than that if alignment checking is enabled and an unaligned memory reference is made, an exception is generated)

I also looked up the lock prefix, which specifically states that

The integrity of the LOCK prefix is not affected by the alignment of the memory field.

(emphasis mine)

So Intel seems to say that alignment is irrelevant. The operation will be atomic no matter what.

The _InterlockedCompareExchange intrinsic documentation also says nothing about alignment, however the InterlockedCompareExchange function states that

The parameters for this function must be aligned on a 32-bit boundary; otherwise, the function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems.

So what gives? Are the alignment requirements for InterlockedCompareExchange just to make sure the function will work even on pre-486 CPU's where the cmpxchg instruction isn't available? That seems likely based on the above information, but I'd like to be sure before I rely on it. :)

Or is alignment required by the ISA to guarantee atomicity, and I'm just looking the wrong places in Intel's reference manuals?

Porridge answered 12/9, 2009 at 14:24 Comment(2)
Yes lock op works on misaligned addresses, but it's potentially much slower. And pure-load / pure-store (mov) on a misaligned variable wouldn't be atomic, and you couldn't make them atomic except by replacing them with xchg or lock cmpxchg: Why is integer assignment on a naturally aligned variable atomic on x86?Hylo
I'm asking myself what this discussion is good for? Who will ever not align an atomic operation?Lyublin
A
10

The PDF you are quoting from is from 1999 and CLEARLY outdated.

The up-to-date Intel documentation, specifically Volume-3A tells a different story.

For example, on a Core-i7 processor, you STILL have to make sure your data doesn't not span over cache-lines, or else the operation is NOT guaranteed to be atomic.

On Volume 3A, System Programming, For x86/x64 Intel clearly states:

8.1.1 Guaranteed Atomic Operations

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:

  • Reading or writing a byte
  • Reading or writing a word aligned on a 16-bit boundary
  • Reading or writing a doubleword aligned on a 32-bit boundary

The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:

  • Reading or writing a quadword aligned on a 64-bit boundary
  • 16-bit accesses to uncached memory locations that fit within a 32-bit data bus

The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically:

  • Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line

Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and P6 family processors provide bus control signals that permit external memory subsystems to make split accesses atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be avoided

Assassin answered 3/3, 2011 at 9:37 Comment(6)
The text I quotes above IS from the Intel manuals and it clearly states the different alignment requirements per processor family. I should have probably used different wording to express that the UPDATED Intel information is very clear, I guess that what you get for reading a .pdf from 1999.Assassin
-1: You got the wrong section from the right manual. Basic memory operations and locked atomic operations are different things.Adnate
@Assassin -1: As MackieMesser rightly points out, your quote talks about basic memory operations and not atomic operations, ie operations prefixed with a LOCK, which was what the OP asked, as LOCK is what's used in case of atomic x86 operations.Debutant
@MackieMesser, I don't follow completely your down vote, the PRM explicitly stresses that certain operations on aligned memory are ATOMIC as if using the LOCK prefix - "Certain basic memory transactions (such as reading or writing a byte in system memory) are always guaranteed to be handled atomically. That is, once started, the processor guarantees that the operation will be completed before another processor or bus agent is allowed access to the memory location."Carnes
@ShmilTheCat Because this is incorrect for CMPXCHG: "you have to make sure data doesn't span several cache lines to be atomic". The quoted section applies to basic memory operations but not locked atomic operations. CMPXCHG works just fine with unaligned addresses. The problem is that it is slow to do so, and that's why aligned addresses are recommended, but not required.Adnate
@MackieMesser: Note that plain cmpxchg itself is not atomic. Your previous comment is true for lock cmpxchg, like the question asked about. (But unlike a pure-load or pure-store as the poorly-chosen quote in this answer is talking about, non-locked cmpxchg on a multi-core system is a non-atomic RMW even on an aligned address. Using it without lock only makes sense on a unicore, since like most instructions it's atomic wrt. interrupts. Is x86 CMPXCHG atomic, if so why does it need LOCK?)Hylo
A
11

x86 does not require alignment for a lock cmpxchg instruction to be atomic. However, alignment is necessary for good performance.

This should be no surprise, backward compatibility means that software written with a manual from 14 years ago will still run on today's processors. Modern CPUs even have a performance counter specifically for split-lock detection because it's so expensive. (The core can't just hold onto exclusive access to a single cache line for the duration of the operation; it does have to do something like a traditional bus lock).

Why exactly Microsoft documents an alignment requirement is not clear. It's certainly necessary for supporting RISC architectures, but the specific claim of unpredictable behaviour on multiprocessor x86 might not even be valid. (Unless they mean unpredictable performance, rather than a correctness problem.)

Your guess of applying only to pre-486 systems without lock cmpxchg might be right; a different mechanism would be needed there which might have required some kind of locking around pure loads or pure stores. (Also note that 486 cmpxchg has a different and currently-undocumented opcode (0f a7) from modern cmpxchg (0f b1) which was new with 586 Pentium; Windows might have only used cmpxchg on P5 Pentium and later, I don't know.) That could maybe explain weirdness on some x86, without implying weirdness on modern x86.

Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3 (3A): System Programming Guide
January 2013

8.1.2.2 Software Controlled Bus Locking

To explicitly force the LOCK semantics, software can use the LOCK prefix with the following instructions when they are used to modify a memory location. [...]

• The exchange instructions (XADD, CMPXCHG, and CMPXCHG8B).
• The LOCK prefix is automatically assumed for XCHG instruction.
• [...]

[...] The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses be aligned on their natural boundaries for better system performance:

• Any boundary for an 8-bit access (locked or otherwise).
• 16-bit boundary for locked word accesses.
• 32-bit boundary for locked doubleword accesses.
• 64-bit boundary for locked quadword accesses.


Fun fact: cmpxchg without a lock prefix is still atomic wrt. context switches, so is usable for multi-threading on a single-core system.

Even misaligned it's still atomic wrt. interrupts (either completely before or completely after), and only memory reads by other devices (e.g. DMA) could see tearing. But such accesses could also see the separation between load and store, so even if old Windows did use that for a more efficient InterlockedCompareExchange on single-core systems, it still wouldn't require alignment for correctness, only performance. If this can be used for hardware access, Windows probably wouldn't do that.

If the library function needed to do a pure load separate from the lock cmpxchg this might make sense, but it doesn't need to do that. (If not inlined, the 32-bit version would have to load its args from the stack, but that's private, not access to the shared variable.)

Adnate answered 20/3, 2013 at 12:11 Comment(0)
A
10

The PDF you are quoting from is from 1999 and CLEARLY outdated.

The up-to-date Intel documentation, specifically Volume-3A tells a different story.

For example, on a Core-i7 processor, you STILL have to make sure your data doesn't not span over cache-lines, or else the operation is NOT guaranteed to be atomic.

On Volume 3A, System Programming, For x86/x64 Intel clearly states:

8.1.1 Guaranteed Atomic Operations

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:

  • Reading or writing a byte
  • Reading or writing a word aligned on a 16-bit boundary
  • Reading or writing a doubleword aligned on a 32-bit boundary

The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:

  • Reading or writing a quadword aligned on a 64-bit boundary
  • 16-bit accesses to uncached memory locations that fit within a 32-bit data bus

The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically:

  • Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line

Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and P6 family processors provide bus control signals that permit external memory subsystems to make split accesses atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be avoided

Assassin answered 3/3, 2011 at 9:37 Comment(6)
The text I quotes above IS from the Intel manuals and it clearly states the different alignment requirements per processor family. I should have probably used different wording to express that the UPDATED Intel information is very clear, I guess that what you get for reading a .pdf from 1999.Assassin
-1: You got the wrong section from the right manual. Basic memory operations and locked atomic operations are different things.Adnate
@Assassin -1: As MackieMesser rightly points out, your quote talks about basic memory operations and not atomic operations, ie operations prefixed with a LOCK, which was what the OP asked, as LOCK is what's used in case of atomic x86 operations.Debutant
@MackieMesser, I don't follow completely your down vote, the PRM explicitly stresses that certain operations on aligned memory are ATOMIC as if using the LOCK prefix - "Certain basic memory transactions (such as reading or writing a byte in system memory) are always guaranteed to be handled atomically. That is, once started, the processor guarantees that the operation will be completed before another processor or bus agent is allowed access to the memory location."Carnes
@ShmilTheCat Because this is incorrect for CMPXCHG: "you have to make sure data doesn't span several cache lines to be atomic". The quoted section applies to basic memory operations but not locked atomic operations. CMPXCHG works just fine with unaligned addresses. The problem is that it is slow to do so, and that's why aligned addresses are recommended, but not required.Adnate
@MackieMesser: Note that plain cmpxchg itself is not atomic. Your previous comment is true for lock cmpxchg, like the question asked about. (But unlike a pure-load or pure-store as the poorly-chosen quote in this answer is talking about, non-locked cmpxchg on a multi-core system is a non-atomic RMW even on an aligned address. Using it without lock only makes sense on a unicore, since like most instructions it's atomic wrt. interrupts. Is x86 CMPXCHG atomic, if so why does it need LOCK?)Hylo
D
4

See this SO question: natural alignment is important for performance, and is required on the x64 architecture (so it's not just PRE-x86 systems, but POST-x86 ones too -- x64 may still be a bit of a niche case but it's growing in popularity after all;-); that may be why Microsoft documents it as required (hard to find docs on whether MS has decided to FORCE the alignment issue by enabling alignment checking -- that may vary by Windows version; by claiming in the docs that alignment is required, MS keeps the freedom to force it in some version of Windows even if they did not force it on others).

Dissonancy answered 12/9, 2009 at 15:15 Comment(5)
Thanks. And bah, of course someone else had asked this before. I shouldn't be surprised... :p About x64, does it require alignment for all atomic instructions, even the ones that didn't require it in 32-bit mode? Not that I don't believe you, but it seems a bit surprising if they're breaking backwards compatibility like that. Got a source for that?Porridge
I have no info on x64's alignment issues except Microsoft's (see also forum.winimage.com/viewtopic.php?t=137 for other discussions and pointers about x64 alignment, beyond atomicity). BTW, what backwards compatibility? x64 is a new architecture (chips that run it also run x86 for old 32-bit code) so there's no "backwards" -- machine code that runs in x64 (rather than x86 legacy mode) has to have been written/compiled/generated specifically for it, not for x86!-)Dissonancy
That link seems to say that alignment is only required on Itanium, not x64, which is what I'd expect. It obviously still has a (major) impact on performance, but it would be odd if x64 suddenly required alignment for instructions that didn't require it in x86. And never mind the backwards compatibility thing. It was half brainfart, and half irrelevant to the question. ;) (The instruction set is basically the same though, as far as I know. The changes mainly consists in adding new instructions, and adding another optional prefix byte to allow you to specify one of the new registers)Porridge
Yes, but you can control whether mis-alignment causes exceptions on both x64 AND itanium -- the one difference is that it defaults to off on x64, to on in itanium (where the perf hit if you disable the exceptions is HUGE -- 10 times, vs 2/3 times on x64). Looks like win64 doesn't supply an intrinsic to enable exceptions (it does supply one to DISABLE them!-), but you can do it in machine code.Dissonancy
@jalf: x86-64 does not require alignment for atomicity of lock cmpxchg. It's identical to 32-bit mode. Setting x86(-64)'s AC (alignment checking) flag will lead to exceptions in most memcpy and so on library implementations so is not viable under mainstream OSes. This answer doesn't make any sense to me. Your guess in the question about pre-486 or pre-586 is the only plausible theory I've seen that could explain a real correctness problem. You should probably accept Mackie's answer on this question, not the one about pure-load / pure-store that's currently accepted but irrelevant.Hylo
K
3

Microsoft's Interlocked APIs also applied to ia64 (while it still existed). There was no lock prefix on ia64, only the cmpxchg.acq and cmpxchg.rel instructions (or fetchadd and other similar beasties), and these all required alignment if I recall correctly.

Knighthood answered 25/11, 2009 at 23:1 Comment(4)
ia86 still exists, and Windows still runs on it, as far as I know. Anyway, my question was specifically about x86. :)Porridge
re: ia64. I'm pretty sure Windows stopped shipping an os for ia64 after intel released their version of amd64 (x86-64), and no vista nor win7 for ia64. Now there is only the x86 and x86-64 versions of the os. As far as I'm concerned that effectively kills ia64 (unless you count HPUX IPF). re: interlocked and x86. It is my recollection that we first started seeing the Interlocked APIs when microsoft released their ia64 version of the platform SDK. If that is accurate, it would likely explain why the Interlocked documentation does not rely on the LOCK prefix semantics of x86.Knighthood
IA64 Windows existed up to Server 2008 R2, which was released just before the previous comment was posted.Helenehelenka
It is my recollection that Microsoft had quietly dropped itanium from their roadmap of new windows versions very quickly after intel announced a 'intel64' version of amd64. Microsoft may have continued support for the already released product versions, but they had killed all the non-server versions, and itanium was effectively recognized as good and dead by anybody who had been developing for it.Knighthood

© 2022 - 2024 — McMap. All rights reserved.