InterlockedExchange and memory alignment

Asked 19/5, 2009 at 9:34 Answered 3/3, 2011 at 9:39

I am confused that Microsoft says memory alignment is required for InterlockedExchange however, Intel documentation says that memory alignment is not required for LOCK. Am i missing something, or whatever? thanks

from Microsoft MSDN Library

Platform SDK: DLLs, Processes, and Threads InterlockedExchange

The variable pointed to by the Target parameter must be aligned on a 32-bit boundary; otherwise, this function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems.

from Intel Software Developer’s Manual;

LOCK instruction Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal insures that the processor has exclusive use of any shared memory while the signal is asserted.

The integrity of the LOCK prefix is not affected by the alignment of the memory field. Memory locking is observed for arbitrarily misaligned fields.
Memory Ordering in P6 and More Recent Processor Families

Locked instructions have a total order.
Software Controlled Bus Locking

The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses be aligned on their natural boundaries for better system performance: •Any boundary for an 8-bit access (locked or otherwise). •16-bit boundary for locked word accesses. •32-bit boundary for locked doubleword accesses. •64-bit boundary for locked quadword accesses.

Pantelleria answered 19/5, 2009 at 9:34 Comment(1)

Does this answer your question? alignment requirements for atomic x86 instructions – Lucretialucretius 27/11, 2019 at 7:21

Once upon a time, Microsoft supported WindowsNT on processors other than x86, such as MIPS, PowerPC, and Alpha. These processors all require alignment for their interlocked instructions, so Microsoft put the requirement in their spec to ensure that these primitives would be portable to different architectures.

Bridges answered 16/6, 2009 at 20:12 Comment(2)

Also x64 mode requires alignment on interlocked operations – Domel 11/9, 2009 at 8:35

@Rom: No it doesn't, x86-64 still "only" needs alignment for performance with lock-prefixed instructions. See alignment requirements for atomic x86 instructions for a quote from Intel's vol.3 manual. The split-lock performance penalty is very high, but it's not a correctness problem. – Lucretialucretius 27/11, 2019 at 7:30

Even though the lock prefix doesn't require memory to be aligned, and the cmpxchg operation that's probably used to implement InterlockedExchange() doesn't require alignment, if the OS has enabled alignment checking then the cmpxchg will raise an alignment check exception (AC) when executed with unaligned operands. Check the docs for the cmpxchg and similar, looking at the list of protected mode exceptions. I don't know for sure that Windows enables alignment checking, but it wouldn't surprise me.

Abduct answered 21/5, 2009 at 8:50 Comment(2)

Should that be? : "cmpxchg operation ... <strike>doesn't</strike> does require alignment" – Acidforming 11/8, 2011 at 8:36

You probably can't enable the AC flag without memcpy faulting, so it's not really a plausible situation / use-case. Modern compilers emit code that does potentially-unaligned loads, too, e.g. loading multiple char or short struct members even if they're not necessarily aligned. Windows certainly does not enable AC by default. – Lucretialucretius 27/11, 2019 at 7:31

Hey, I answered a few questions related to this, also keep in mind;

There is NO byte level InterlockedExchange there IS a 16 bit short InterlockedExchange however.
The documentation discrepency you refer, is probably just some documentation oversight.
If you want todo Byte/Bit level atomic access, there ARE pleanty of ways todo this with the existing intrinsics, Interlocked[And8|Or8|Xor8]
Any operation where your doing high-perf locking (using the machiene code like you discuss), should not be operating un-aligned (performance anti-pattern)
xchg (optimized instruction with implicit LOCK prefix, optimized due to ability to cache lock and avoid a full bus lock to main memory). CAN do 8bit interlocked operations.

I nearly forgot, from Intel's TBB, they have Load/Store 8bit's defined w/o the use of implicit or explicit locking (in some cases);

.code 
    ALIGN 4
    PUBLIC c __TBB_machine_load8
__TBB_machine_Load8:
    ; If location is on stack, compiler may have failed to align it correctly, so we do dynamic check.
    mov ecx,4[esp]
    test ecx,7
    jne load_slow
    ; Load within a cache line
    sub esp,12
    fild qword ptr [ecx]
    fistp qword ptr [esp]
    mov eax,[esp]
    mov edx,4[esp]
    add esp,12
    ret

EXTRN __TBB_machine_store8_slow:PROC
.code 
    ALIGN 4
    PUBLIC c __TBB_machine_store8
__TBB_machine_Store8:
    ; If location is on stack, compiler may have failed to align it correctly, so we do dynamic check.
    mov ecx,4[esp]
    test ecx,7
    jne __TBB_machine_store8_slow ;; tail call to tbb_misc.cpp
    fild qword ptr 8[esp]
    fistp qword ptr [ecx]
    ret
end

Anyhow, hope that clears at leat some of this up for you.

Konyn answered 29/5, 2009 at 10:58 Comment(1)

Those are 8-byte load/store implementations that use x87 to do 64-bit (qword) aligned load/store. (see Why is integer assignment on a naturally aligned variable atomic on x86?). 8-bit pure load / pure store is always atomic. – Lucretialucretius 27/11, 2019 at 6:55

-2

I don't understand where your Intel information is coming from.

To me, its pretty clear that Intel cares A LOT about alignment and/or spanning cache-lines.

For example, on a Core-i7 processor, you STILL have to make sure your data doesn't not span over cache-lines, or else the operation is NOT guaranteed to be atomic.

On Volume 3-I, System Programming, For x86/x64 Intel clearly states:

8.1.1 Guaranteed Atomic Operations

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:

Reading or writing a byte

Reading or writing a word aligned on a 16-bit boundary

Reading or writing a doubleword aligned on a 32-bit boundary

The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:

Reading or writing a quadword aligned on a 64-bit boundary

16-bit accesses to uncached memory locations that fit within a 32-bit data bus

The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically:

Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line

Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and P6 family processors provide bus control signals that permit external memory subsystems to make split accesses atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be avoided.

Eventide answered 3/3, 2011 at 9:39 Comment(1)

The information presented in this answer seems to relate to "basic memory operations" whereas the question's context is locked operations. – Acidforming 11/8, 2011 at 8:34

from Microsoft MSDN Library

from Intel Software Developer’s Manual;

8.1.1 Guaranteed Atomic Operations

Recommended topics

Hot tags