How do modern cpus handle crosspage unaligned access?

Asked 11/5, 2014 at 14:46 Answered 22/2, 2016 at 10:39

Solved linux arm x86-64 memory-alignment mmu

I'm trying to understand how unaligned memory access (UMA) works on modern processors (namely x86-64 and ARM architectures). I get that I might run into problems with UMA ranging from performance degradation till CPU fault. And I read about posix_memalign and cache lines.

What I cannot find is how the modern systems/hardware handle the situation when my request exceeds page boundaries?

Here is an example:

I malloc() an 8KB chunk of memory.
Let's say that malloc() doesn't have enough memory and sbrk()s 8KB for me.
The kernel gets two memory pages (4KB each) and maps them into my process's virtual address space (let's say that these two pages are not one after another in memory
movq (offset + $0xffc), %rax I request 8 bytes starting at the 4092th byte, meaning that I want 4 bytes from the end of the first page and 4 bytes from the beginning of the second page.

Physical memory:

---|---------------|---------------|-->
   |... 4b|        |        |4b ...|-->

I need 8 bytes that are split at the page boundaries.

How do MMU on x86-64 and ARM handle this? Are there any mechanisms in kernel MM to somehow prepare for this kind of request? Is there some kind of protection in malloc? What do processors do? Do they fetch two pages?

I mean to complete such request MMU has to translate one virtual address to two physical addresses. How does it handle such request?

Should I care about such things if I'm a software programmer and why?

I'm reading a lot of links from google, SO, drepper's cpumemory.pdf and gorman's Linux VMM book at the moment. But it's an ocean of information. It would be great if you at least provide me with some pointers or keywords that I could use.

Sihon answered 11/5, 2014 at 14:46 Comment(0)

I'm not overly familiar with the guts of the Intel architecture, but the ARM architecture sums this specific detail up in a single bullet point under "Unaligned data access restrictions":

An operation that performs an unaligned access can abort on any memory access that it makes, and can abort on more than one access. This means that an unaligned access that occurs across a page boundary can generate an abort on either side of the boundary.

So other than the potential to generate two page faults from a single operation, it's just another unaligned access. Of course, that still assumes all the caveats of "just another unaligned access" - namely it's only valid on normal (not device) memory, only for certain load/store instructions, has no guarantee of atomicity and may be slow - the microarchitecture will likely synthesise an unaligned access out of multiple aligned accesses¹, which means multiple MMU translations, potentially multiple cache misses if it crosses a line boundary, etc.

Looking at it the other way, if an unaligned access doesn't cross a page boundary, all that means is that if the aligned address for the first "sub-access" translates OK, the aligned addresses of any subsequent parts are sure to hit in the TLB. The MMU itself doesn't care - it just translates some addresses that the processor gives it. The kernel doesn't even come into the picture unless the MMU raises a page fault, and even then it's no different from any other page fault.

I've had a quick skim through the Intel manuals and their answer hasn't jumped out at me - however in the "Data Types" chapter they do state:

[...] the processor requires two memory accesses to make an unaligned access; aligned accesses require only one memory access.

so I'd be surprised if wasn't broadly the same (i.e. one translation per aligned access).

Now, this is something most application-level programmers shouldn't have to worry about, provided they behave themselves - outside of assembly language, it's actually quite hard to make unaligned accesses happen. The likely culprits are type-punning pointers and messing with structure packing, both things that 99% of the time one has no reason to go near, and for the other 1% are still almost certainly the wrong thing to do.

[1] The ARM architecture pseudocode actually specifies unaligned accesses as a series of individual byte accesses, but I'd expect implementations actually optimise this into larger aligned accesses where appropriate.

Daph answered 11/5, 2014 at 18:41 Comment(5)

Thank you. This is really a great answer, more than I expected. – Sihon 11/5, 2014 at 19:21

arm can/will turn unaligned writes into 2 or three separate transactions, but reads can be done in a single transaction since the length is in units of the bus width and the mask is for writes not reads. – Trunkfish 11/5, 2014 at 19:28

@dwelch True (as I alluded to), but a read across a page boundary would still need to be split since there's no guarantee the physical addresses are contiguous. – Daph 11/5, 2014 at 22:58

It is actually configurable on an ARM, but since the question is tagged with Linux, every thing is correct. For instance, an un-aligned access can just do a masked load with rotations and there are no table walks with the MMU off. Worse case is a boundary that straddles two MMU-L1 tables. You could have 6 cache misses; 2xMMU-L1 , 2xMMU-L2 and the two data accesses. For user space, this may mean two reads from disk/backing store. The instruction may also be emulated. – Tycho 12/5, 2014 at 14:34

Ah, review of the instruction emulation give some really horrid cases, like ldm and ldrd where multiple unaligned words are used. A compiler may do this with an unaligned structure pointer, etc. If you are smart, don't use un-aligned data. Usually, it is not hard to eliminate with a little thought. Typcially, this is used by serialization code; but there is lots of extra thought (swapping, efficient transfer, etc) that should be done. So wrappers should be used for serialization. – Tycho 12/5, 2014 at 17:16

So the architecture doesnt really matter other than x86 has traditionally not directly told you not to where mips and arm traditionally generate a data abort rather than trying to just make it work.

where it doesnt matter is that all processors have a fixed number of pins a fixed size (maximum) data bus a fixed size (max) address bus, "modern processors" tend to have data busses more than 8 bits wide but the units on addresses is still an 8 bit byte, so the opportunity for unaligned exists. Anything larger than one byte in a particular transfer has the opportunity of being unaligned if the architecture allows.

Transfers are typically in some units of bytes and/or bus widths. On an ARM amba/axi bus for example the length field is in units of bus widths, 32 or 64 bits, 4 or 8 bytes. And no it is not going to be in units of 4Kbytes....

(yes this is elementary I assume you understand all of this).

Whether it is 16 bits or 128 bits, the penalty for unaligned comes from the additional bus cycles which these days is an extra bus clock per. So for an ARM 16 bit unaligned transfer (which arm will support on its newer cores without faulting) that means you need to read 128 bits instead of 64, 64 bits to get 16 is not a penalty as 64 is the smallest size for a bus transfer. Each transfer whether it is a single width of the data bus or multiple has multiple clock cycles associated with it, lets say there are 6 clock cycles to do an aligned 16 bit read, then ideally it is 7 cycles to do an unaligned 16 bit. Seems small but it does add up.

caches help alot because the dram side of the cache will be setup to use multiples of the bus width and will always do aligned accesses for cache fetches and evictions. not-cached accesses will follow the same pain except the dram side is not handfuls of clocks but dozens to hundreds of clocks of overhead.

For random access a single 16 bit read that not only spans a bus width boundary but also happens to cross a cache line boundary will not just incur the one additional clock on the processor side but worst case it can incur an addition cache line fetch which is dozens to hundreds of additional clock cycles. if you were walking through an array of things that happen to not be aligned (structures/unions may be an example depending on the compiler and code) that additional cache line fetch would have happened anyway, if the array of things is a little over on one or both ends then you might still incur one or two more cache line fetches that you would have avoided had the array been aligned.

That is really the key to this on reads is before or after an aligned area you might have to incur a transfer for each one for each side you spill into.

Writes are both good and bad. random reads are slower because the transaction has to stall until the answer comes back. For a random write the memory controller has all the information it needs it has the address, data, byte mask, transfer type, etc. So it is fire and forget the processor has done its job and can call the transaction complete from its perspective and move on. Naturally gang too much of these up or do a read on something just written and then the processor stalls due to the completion of a prior write in addition to the current transaction.

An unaligned 16 bit write for example does not only incur the additional read cycle but assuming a 32 or 64 bit wide bus that would be one byte per location so you have to do a read-modify-write on whatever that closest memory is (cache or dram). so depending on how the processor and then memory controller implements it it can be two individual read-modify-write transactions (unlikely since that incurs twice the overhead), or the double width read, modify both parts, and a double width read. incurring two additional clocks over and above the overhead, the overhead is doubled as well. If it had been an aligned bus width write then no read-modify-write is required, you save the read. Now if this read-modify-write is in the cache then that is pretty fast but still noticeable up to a few clocks depending on what is queued up and you have to wait on.

I am also most familiar with ARM. Arm traditionally would punish an unaligned access with an abort, you could turn that off, and you would instead get a rotation of the bus rather than it spilling over which would make for some nice freebie endian swaps. the more modern arm cores will tolerate and implement an unaligned transfer. Understand for example a store multiple of say 4 or more registers against a non-64-bit-aligned address is not considered an unaligned access even though it is a 128 bit write to an address that is neither 64 nor 128 bit aligned. What the processor does in that case is brakes it into 3 writes, an aligned 32 bit write, an aligned 64 bit write and an aligned 32 bit write. the memory controller does not have to deal with the unaligned stuff. That is for legal things like store multiple. the core I am familiar with wont do a write length of more than 2 anyway, an 8 register store multiple, is not a single length of 4 write it is 2 separate length of two writes. But a load multiple of 8 registers, so long it is aligned on a 64 bit address is a single length of 4 transaction. I am pretty sure that since there is no masking on the bus side for a read, everything is in units of bus width, there is no reason to break say a 4 register load multiple on an address that is not 64 bits aligned into 3 transactions, simply do a length of 3 read. When the processor reads a single byte you cant tell that from the bus all you see is a 64 bit read AFAIK. The processor strips the byte lane out. If the processor/bus does care be it arm, x86, mips, etc, then sure you will hopefully see separate transfers.

Does everyone do this? no older processors (not thinking of an arm nor x86) would put more burden on the memory controller. I dont know what modern x86 and mips and such do.

Your malloc example. First off you are not going to see single bus transfers of 4Kbytes, that 4k will be broken up into digestible bits anyway. first off it has to do one to many bus cycles against the memory management unit to find the physical address and other properties anyway (those answers can get cached to make them faster, but sometimes they have to go all the way out to slow dram) so for that example the only transfer that matters is an aligned transfer that splits the 4k boundary, say a 16 bit transfer, for the mmu system to work at all the only way for that to be supported is that has to be turned into two separate 8 bit transfers that happen in those physical address spaces, and yes that literally doubles everything the mmu lookup cycles the cache/dram bus cycles, etc. Other than that boundary there is nothing special about your 8k being split. the bulk of your cycles will be within one of the two 4k pages, so it looks like any other random access, with of course repetitive/sequential accesses gaining the benefit of caching.

The short answer is that no matter what platform you are on either 1) the platform will abort an unaligned transfer, or 2) somewhere in the path there is an additional one or more (dozens/hundreds) as a result of the unaligned access compared to an aligned access.

Trunkfish answered 11/5, 2014 at 19:56 Comment(1)

+1 for the 10th paragraph (I am also most familiar with ARM...) and the last. – Tycho 12/5, 2014 at 17:22

It doesn't matter whether the physical pages are adjacent or not. Modern CPUs use caches. Data is transferred to/from DRAM a full cache-line at a time. Thus, DRAM will never see a multi-byte read or write that crosses a 64B boundary, let alone a page boundary.

Stores that cross a page boundary are still slow (on modern x86). I assume the hardware handles the page-split case by detecting it at some later pipeline stage, and triggering a re-do that does two TLB checks. IDK if Intel designs insert extra uops into the pipeline to handle it, or what. (i.e. impact on latency, throughput of page-splits, throughput of all memory accesses, throughput of other (e.g. non-memory) uops).

Normally there's no penalty at all for unaligned accesses within a cache-line (since about Nehalem), and a small penalty for cache-line splits that aren't page-splits. An even split is apparently cheaper than others. (e.g. a 16B load that takes 8B from one cache line and 8B from another).

Anyway, DRAM will never see an unaligned access directly. AFAIK, no sane modern design has only write-through caches, so DRAM only sees writes when a cache-line is flushed, at which point the fact that one unaligned access dirtied two cache lines is not available. Caches don't even record which bytes are dirty; they just burst-write the whole 64B to the next level down (or last-level to DRAM) when needed.

There are probably some CPU designs that don't work this way, but Intel and AMD's designs are also this way.

Caveat: loads/stores to uncachable memory regions might produce smaller stores, but probably still only within a single cache-line. (On x86, this prob. applies to MOVNT non-temporal stores that use write-combining store buffers but otherwise bypass the cache).

Uncacheable unaligned stores that cross a page boundary are probably still split into separate stores (because each part needs a separate TLB translation).

Caveat 2: I didn't fact-check this. I'm certain about the whole-cache-line aligned access to DRAM for "normal" loads/stores to "normal" memory regions, though.

Scorper answered 22/2, 2016 at 10:39 Comment(1)

Update: How can I accurately benchmark unaligned access speed on x86_64? has much more detailed info. (No extra load uop, but a later cycle is needed on the port.) – Scorper 7/12, 2023 at 15:0

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags