why is data structure alignment important for performance?
Asked Answered
E

4

38

Can someone give me a short and plausible explanation for why the compiler adds padding to data structures in order to align its members? I know that it's done so that the CPU can access the data more efficiently, but I don't understand why this is so.

And if this is only CPU related, why is a double 4 byte aligned in Linux and 8 byte aligned in Windows?

Emulsify answered 5/1, 2010 at 13:18 Comment(3)
There are two separate but related issues: data alignment and data structure paddingDung
gcc aligns dobules on 8 bytes as well on x86 machines though, same as Microsoft' compiler.Kathernkatheryn
why are doubles 8 byte aligned if the CPU reads data in chunks of 4bytes? it shouldn't matter then whether the double is 8 or 4byte aligned, no?Emulsify
W
21

Alignment helps the CPU fetch data from memory in an efficient manner: less cache miss/flush, less bus transactions etc.

Some memory types (e.g. RDRAM, DRAM etc.) need to be accessed in a structured manner (aligned "words" and in "burst transactions" i.e. many words at one time) in order to yield efficient results. This is due to many things amongst which:

  1. setup time: time it takes for the memory devices to access the memory locations
  2. bus arbitration overhead i.e. many devices might want access to the memory device

"Padding" is used to correct the alignment of data structures in order to optimize transfer efficiency.


In other words, accessing a "mis-aligned" structure will yield lower overall performance. A good example of such pitfall: suppose a data structure is mis-aligned and requires the CPU/Memory Controller to perform 2 bus transactions (instead of 1) in order to fetch the said structure, the performance is thus consequently lower.

Wicked answered 5/1, 2010 at 13:23 Comment(5)
so what exactly happens if, let's say a float is 1byte aligned?Emulsify
@Mat: then, depending on "where" the "float variable" ends up being allocated in memory, efficiency in accessing this "float variable" will vary.Wicked
but do I understand correctly, that the performance for accessing a badly aligned float will not be worse than accessing a correctly aligned double?Emulsify
@Mat: no: accessing a mis-aligned structure will result in lower overall efficiency i.e. lower performance.Wicked
Assume a scenario in which there are 2 chunks of data, one of which is 4 bytes and the other is a single byte. Isn't the bus transaction overhead the same whether the second chunk is padded to 4 bytes or not? CPU will try to fetch first the 4 byte one and then the single byte one. If the second chunk was to be padded to 4 bytes, the CPU will again try to fetch 4 bytes 2 times in a row. What does it change in terms of performance to pad the second chunk to 4 bytes in this case? Note: I assumed a word size of 32 bit.Dare
B
14

the CPU fetches data from memory in groups of 4 bytes (it actualy depends on the hardware its 8 or other values for some types of hardware, but lets stick with 4 to keep it simple), all is well if the data begins in an address which is dividable by 4, the CPU goes to the memory address and loads the data.

now suppose the data begins in an address not dividable by 4 say for the sake of simplicity at address 1, the CPU must take data from address 0 and then apply some algorithm to dump the byte at the 0 address , to gain access to the actual data at byte 1. this takes time and therefore lowers preformance. so it is much more efficient to have all data addresses aligned.

Briquet answered 5/1, 2010 at 13:33 Comment(4)
not necessarily in groups of 4bytes: this is highly dependent on the CPU type.Wicked
This is a bit simplified: It's fine to have a BYTE-sized value at a memory location not dividable by 4. It's also fine to have a WORD-sized value at a memory location dividable by 2.Proclitic
@Alon: it can much worse than what you depict ("dumping bytes"): supposed the data structure spawns a boundary that requires an additional bus transaction, then performance goes out the window.Wicked
@jldupont: I agree. I disregarded paging and caches levels and sizes to illustrate the main point more clearlyBriquet
S
9

A cache line is a basic unit of caching. Typically it is 16-64 bytes or more.

Pentium IV: 64 bytes; Pentium Pro/II: 32 bytes; Pentium I: 32 bytes; 486: 16 bytes.

myrandomreader:
  ; ...
  ; ten instructions to generate next pseudo-random
  ; address in ESI from previous address
  ; ...
  MOV EAX, DS:[ESI]   ; X
  LOOP myrandomreader

For memory read straddling two cachelines:

(for L1 cache miss) the processor must wait for the whole of cache line 1 to be read from L2->L1 into the processor before it can request the second cache line, causing a short execution stall

(for L2 cache miss) the processor must wait for two burst reads from L3 cache (if present) or main memory to complete rather than one

Processor stalls

  • A random 4 byte read will straddle a cacheline boundary about 5% of the time for 64 byte cachelines, 10% for 32 byte ones and 20% for 16 byte ones.

  • There may be additional execution overheads for some instructions on misaligned data even if it is within a cacheline. This is talked about on the Intel website for some SSE instructions.

  • If you are defining the structures yourself, it may make sense to look at listing all the <32bit data fields together in a struct so that padding overhead is reduced or alternatively review whether it is better to turn packing on or off for a particular structure.

  • On MIPS and many other platforms you don't get the choice and must align - kernel exception if you don't!!

  • Alignment may also matter extra specially to you if you are doing I/O on the bus or using atomic operations such as atomic increment/decrement or if you wish to be able to port your code to non-Intel.

  • On Intel only (!) code, a common practice is to define one set of packed structures for network and disk, and another padded set for in-memory and to have routines to convert data between these formats (also consider "endianness" for the disk and network formats).

Sowers answered 5/1, 2010 at 16:46 Comment(0)
C
4

In addition to jldupont's answer, some architectures have load and store instructions (those used to read/write to and from memory) that only operate on word aligned boundaries - so, to load a non-aligned word from memory would take two load instructions, a shift instruction, and then a mask instruction - much less efficient!

Coth answered 5/1, 2010 at 13:34 Comment(2)
does reading a type which is smaller than 4 bytes (bool, short, whatever) always include a masking operation and if it's not 4byte aligned also a shift instruction?Emulsify
@Mat: not necessarily a "shift instruction": at the circuitry level, the chip designers are used to refer to this type of operation as something along the lines of "byte swappers".Wicked

© 2022 - 2024 — McMap. All rights reserved.