Is it worth bothering to align AVX-256 memory stores?

Asked 16/6, 2017 at 9:59 Answered 17/6, 2017 at 11:19

performance assembly x86-64 memory-alignment avx

According to the Intel® 64 and IA-32 Architectures Optimization Reference Manual, section B.4 ("Performance Tuning Techniques for Intel® Microarchitecture Code Name Sandy Bridge"), subsection B.4.5.2 ("Assists"):

32-byte AVX store instructions that span two pages require an assist that costs roughly 150 cycles.

I'm using YMM registers for copying small fixed-sized memory blocks, from 32 to 128 bytes, and the blocks are aligned by 16 bytes, in a heap manager. That heap manager has used XMM registers before with movdqa, and I would like to "upgrade" it to YMM, without changing the alignment from 16 to 32 bytes. So I'm using vmovdqu ymm0, ymmword ptr [rcx], then vmovdqu ymmword ptr [rdx], ymm0 etc...

If I understood Intel's document correctly about the page size, if I do a 32-byte store across a 4K-page boundary, I will get 150 cycles penalty.

But since the blocks are already aligned by 16 bytes, the chances that I hit the cross-page store is 16/4096 = 1/256. If we statistically extrapolate that, on each 32-byte store, I get 1/256*150 (=0.5859375) cycles penalty on Sandy Bridge.

This is not that much and is definitely cheaper than branching to check the alignment, or memory waste because of changing alignment from 16 bytes to 32 bytes.

I have the following questions:

Are my calculations correct?
Is aligning AVX-256 memory stores worth bothering for small fixed-sized memory copy routines (32-128 bytes), given that chances to hit the penalty are so low?
Are there processors that have higher unaligned 32-byte store penalties than Sandy Bridge—e.g., AMD or other Intel microarchitectures?

Nevels answered 16/6, 2017 at 9:59 Comment(20)

Note that this penalty is probably not equidistributed. It is likely that in certain use cases, you are going to see it every single time and in others never. – Lilah 16/6, 2017 at 11:26

@CodyGray - I'm using Delphi Programming Language. It has Win32 and Win64 tagets. The Win32 is very good - fast code, written by people who took part in a contest - fastcode.sourceforge.net - but it was many years ago. The 64-bit code is almost all written on high-level pascal and is very slow. What I want is to improve that 64-bit code of basic runtime libraries like memory manager, copy and fill functions used as a standard library by all code, etc. Agner Fog's library is GPL, can't use it. Can get some ideas, but I need to understand the basics. – Nevels 16/6, 2017 at 11:42

@CodyGray - now 32-bit version of our app is even faster than 64-bit, and we would like be vice versa. – Nevels 16/6, 2017 at 11:42

@CodyGray our software is a mass-market software used by thousands of users, under various systems. But from what I understand, the majority of users (>50%) have relatively new CPUs, made in 2013-2017. So they for sure have YMM registers. So, only on memory copy and fill we can make huge gains. And I want 64-bit version to be faster, not just the same speed ;-) – Nevels 16/6, 2017 at 12:21

alignment in general not just with AVX is always better, but you are using pascal, and your arguments are sound. I read the same thing you read. Not sure which column in the page size table relates to AVX but the worst case is 4KB. If you are already aligned on 16 bytes then only one case out of 256 will cross the boundary, the other 255 wont, so your odds are correct assuming the data moves around, if it is a compile time thing you could end up with 100%. – Conlen 16/6, 2017 at 13:35

as far as any other processor now, past or future, that is too broad, part of tuning for x86 is that you tune for one you hurt your performance for others. maybe someone here knows the gory details and can memorize them for all processors, but the odds of that are not good. – Conlen 16/6, 2017 at 13:37

the penalty is 150 on top of whatever it would have taken normally, so 150 on 5 or 150 on 1000. so even if it does hit how much does it really hurt? – Conlen 16/6, 2017 at 13:39

@Conlen - you are right. Maybe I will change the code of the the heap manager and will increases alignment from 16 bytes to 32 bytes at least for blocks of 32 bytes and larger - if we run on AVX - it will allow me to always use vmovdqa ymm, not vmovdqu ymm, and it will bring benefits in other places - for example, if we will later need to copy of fill that memory - it will be already 32-byte aligned. – Nevels 16/6, 2017 at 17:7

If you're only copying small blocks on the order of 32 - 128 bytes, there's one more to thing to worry about - forwarding stalls. If the user writes to the memory using smaller stores, then calls your optimized copy which uses SIMD, store-forwarding will fail and you'll take a fairly large penalty. For this reason, I actually do not recommend trying to manually SIMD optimize small mem-copies unless you know you're touching cold data. – Frausto 16/6, 2017 at 19:44

Faster copy-fill etc routines for x64 have already been written under a MPL licence. You can the ones I wrote at github.com/JBontes/FastCode/blob/master/fastsystem.pas or get the same routines automatically patched into the RTL by downloading SynCommon.pas at github.com/synopse/mORMot/blob/master/SynCommons.pas. Note that 256AVX moves don't make a lot of sense, because most CPU's split 256 AVX instructions into 2 128-bit ones. In my testing on Skylake I could not tell the difference between AVX2 and SSE moves. – Shores 16/6, 2017 at 20:34

More important are the non-temporal moves for large blocks, which is also incorporated in the above mentioned code. If you are able to improve on said code I'd be impressed. It's faster than Agner's code. – Shores 16/6, 2017 at 20:38

@CodyGray - The 64-bit compiler Delphi pascal compiler produces such a code to get, for example, a pointer from an array by index, so it doesn't generate an instruction to scale the index register by the size of each array element, as the 32-bit compiler does; instead, it moves a constant with size of each array element to one register and index to another register and issues an explicit imul instruction, and then uses the result as an offset. What a shame! – Nevels 17/6, 2017 at 4:55

@Shores - thank you about the information that 256-bit AVX moves are not faster, I will read the Intel manual more and will test on multiple processors, I will also do tests, I have all generations of Intel processors on my disposition for last 10 years except Knights Landing (don't have AMD unfortunately), so I will do tests and will let you know. – Nevels 17/6, 2017 at 5:4

@Frausto - thank you for pointing that out. These special copy routines are used exclusively in a memory manager (heap manager) of the entire application. When realloc happens, it does a table lookup to get a corresponding fixed-size copy memory routine and calls it; for blocks larger than 128 byte there is a generic routine, but it is also special - it doesn't need to worry about granularity and alignment - since these memory blocks are already granular and aligned. So, anyway, they may be more efficient than one-size-fits-all memory copy routine that has lots of branching. – Nevels 17/6, 2017 at 5:8

@Johan: Only Intel SnB/IvB (and all AMD CPUs) split AVX256 loads/stores into two 128-bit cache accesses. Intel Haswell and later do 256-bit transfers to/from L1D cache. So it should be faster to use AVX for copies, unless you're bottlenecked on L2 or L3 cache, or on main memory. (Or maybe if you only use 256-bit AVX infrequently for very short bursts, the upper-lane power-down effect might hurt you (agner.org/optimize/blog/read.php?i=415#415). IDK if it applies to simple loads/stores or just to ALU.) – Compel 30/6, 2017 at 5:54

@Maxim: Note that on Skylake and later, the page-split penalty is much smaller. (Like 5 or 10 cycles instead of 150). But it's still high on Haswell/Broadwell and earlier, so it's worth avoiding when it's cheap. You say that your memory manager's smallest block size is 32 bytes. Why can't you just have them all 32-byte aligned without wasting much memory? (Maybe a silly question; I haven't looked seriously at memory allocator code) – Compel 30/6, 2017 at 5:57

@PeterCordes - I had to increased minimum block size (granularity) to accomodate to the size of an AVX register (32 bytes). But for AVX-512 I would have to find a solution - either to increase granularity again (to 64 bytes) or use code to ensure aligned moves, or, as you have pointed out, just neglect the penalty as statistically insignificant - although I have no idea how big the penalty can be on AVX-512. I'm now trying to get a CPU with AVX-512 support, didn't manage to get it with the motherboard. – Nevels 30/6, 2017 at 11:6

@MaximMasiutin: On SKL-AVX512 Google Cloud VMs, results so far are tha 64B alignment for AVX512 is a lot more important than 32B alignment for AVX2, while looping over big arrays (like 1MB) and doing FP FMA and stuff. I'd suggest using a different copy function for small blocks. i.e. only use AVX for 32B or larger blocks, and only use AVX512 for 64B or larger blocks. You can still have 16B blocks, but copy them with 128-bit AVX instructions. – Compel 30/6, 2017 at 11:11

@PeterCordes - what do you mean by "only use AVX for 32B or larger blocks, and only use AVX512 for 64B or larger blocks. You can still have 16B blocks, but copy them with 128-bit AVX instructions"? By AVX-512 I mean ZMM which is 64 bytes, so I cannot use AVX512 for blocks smaller than 64 bytes -- I have no choice here, while you seem to offer a choice to select from an option ("only use AVX512 for 64B or larger blocks")? – Nevels 30/6, 2017 at 11:35

You don't have to use the same instructions for all block sizes. Your code knows what size blocks it's copying, so for 16B blocks, copy with vmovdqa xmm. For 32B or larger blocks, copy with vmovdqa ymm. (Or if AVX512 is available, use vmovdqa64 zmm for 64B or larger blocks). If different block-sizes already take different code paths, then you don't need any extra branching. (Just maybe some code duplication if there was some common stuff.) If not, then maybe it's not worth it to branch too much to select different copy strategies for different block sizes. – Compel 30/6, 2017 at 11:46

Is it worth bothering to align [...] ?

Yes, definitely worth it and its also very cheap.

You can do aligned writes to an unaligned block easily without needing jumps.
For example:

//assume rcx = length of block, assume length > 8.
//assume rdx = pointer to block
xor rax,rax
mov r9,rdx         //remember r9 for later
sub rcx,8           
mov [rdx],rax      //start with an unaligned write
and rdx,not(7)     //force alignment
lea r8,[rdx+rcx]   //finish with unaligned tail write
xor r9,rdx         //Get the misaligned byte count.
sub rcx,r9
jl @tail           //jl and fuse with sub
@loop:
  mov [rdx],rax    //all writes in this block are aligned.
  lea rdx,[rdx+8]  
  sub rcx,8
  jns @loop
@tail 
mov [r8],rax       //unaligned tail write

I'm sure you can extrapolate this example from a non-unrolled example to an optimized AVX2 example.

Alignment is a simple matter of a misalignment= start and not(alignmentsize -1).
You can then do a misalignmentcount = start xor misalingment to get a count of the misaligned bytes.

None of this requires jumps.
I'm sure you can translate this to AVX.

The below code for FillChar about 3x faster than the standard libs.
Note that I've used jumps, the testing showed it was faster to do so.

{$ifdef CPUX64}
procedure FillChar(var Dest; Count: NativeInt; Value: Byte);
//rcx = dest
//rdx=count
//r8b=value
asm
              .noframe
              .align 16
              movzx r8,r8b           //There's no need to optimize for count <= 3
              mov rax,$0101010101010101
              mov r9d,edx
              imul rax,r8            //fill rax with value.
              cmp edx,59             //Use simple code for small blocks.
              jl  @Below32
@Above32:     mov r11,rcx
              rep mov r8b,7          //code shrink to help alignment.
              lea r9,[rcx+rdx]       //r9=end of array
              sub rdx,8
              rep mov [rcx],rax      //unaligned write to start of block
              add rcx,8              //progress 8 bytes 
              and r11,r8             //is count > 8? 
              jz @tail
@NotAligned:  xor rcx,r11            //align dest
              lea rdx,[rdx+r11]
@tail:        test r9,r8             //and 7 is tail aligned?
              jz @alignOK
@tailwrite:   mov [r9-8],rax         //no, we need to do a tail write
              and r9,r8              //and 7
              sub rdx,r9             //dec(count, tailcount)
@alignOK:     mov r10,rdx
              and edx,(32+16+8)      //count the partial iterations of the loop
              mov r8b,64             //code shrink to help alignment.
              mov r9,rdx
              jz @Initloop64
@partialloop: shr r9,1              //every instruction is 4 bytes
              lea r11,[rip + @partial +(4*7)] //start at the end of the loop
              sub r11,r9            //step back as needed
              add rcx,rdx            //add the partial loop count to dest
              cmp r10,r8             //do we need to do more loops?
              jmp r11                //do a partial loop
@Initloop64:  shr r10,6              //any work left?
              jz @done               //no, return
              mov rdx,r10
              shr r10,(19-6)         //use non-temporal move for > 512kb
              jnz @InitFillHuge
@Doloop64:    add rcx,r8
              dec edx
              mov [rcx-64+00H],rax
              mov [rcx-64+08H],rax
              mov [rcx-64+10H],rax
              mov [rcx-64+18H],rax
              mov [rcx-64+20H],rax
              mov [rcx-64+28H],rax
              mov [rcx-64+30H],rax
              mov [rcx-64+38H],rax
              jnz @DoLoop64
@done:        rep ret
              //db $66,$66,$0f,$1f,$44,$00,$00 //nop7
@partial:     mov [rcx-64+08H],rax
              mov [rcx-64+10H],rax
              mov [rcx-64+18H],rax
              mov [rcx-64+20H],rax
              mov [rcx-64+28H],rax
              mov [rcx-64+30H],rax
              mov [rcx-64+38H],rax
              jge @Initloop64        //are we done with all loops?
              rep ret
              db $0F,$1F,$40,$00
@InitFillHuge:
@FillHuge:    add rcx,r8
              dec rdx
              db $48,$0F,$C3,$41,$C0 // movnti  [rcx-64+00H],rax
              db $48,$0F,$C3,$41,$C8 // movnti  [rcx-64+08H],rax
              db $48,$0F,$C3,$41,$D0 // movnti  [rcx-64+10H],rax
              db $48,$0F,$C3,$41,$D8 // movnti  [rcx-64+18H],rax
              db $48,$0F,$C3,$41,$E0 // movnti  [rcx-64+20H],rax
              db $48,$0F,$C3,$41,$E8 // movnti  [rcx-64+28H],rax
              db $48,$0F,$C3,$41,$F0 // movnti  [rcx-64+30H],rax
              db $48,$0F,$C3,$41,$F8 // movnti  [rcx-64+38H],rax
              jnz @FillHuge
@donefillhuge:mfence
              rep ret
              db $0F,$1F,$44,$00,$00  //db $0F,$1F,$40,$00
@Below32:     and  r9d,not(3)
              jz @SizeIs3
@FillTail:    sub   edx,4
              lea   r10,[rip + @SmallFill + (15*4)]
              sub   r10,r9
              jmp   r10
@SmallFill:   rep mov [rcx+56], eax
              rep mov [rcx+52], eax
              rep mov [rcx+48], eax
              rep mov [rcx+44], eax
              rep mov [rcx+40], eax
              rep mov [rcx+36], eax
              rep mov [rcx+32], eax
              rep mov [rcx+28], eax
              rep mov [rcx+24], eax
              rep mov [rcx+20], eax
              rep mov [rcx+16], eax
              rep mov [rcx+12], eax
              rep mov [rcx+08], eax
              rep mov [rcx+04], eax
              mov [rcx],eax
@Fallthough:  mov [rcx+rdx],eax  //unaligned write to fix up tail
              rep ret

@SizeIs3:     shl edx,2           //r9 <= 3  r9*4
              lea r10,[rip + @do3 + (4*3)]
              sub r10,rdx
              jmp r10
@do3:         rep mov [rcx+2],al
@do2:         mov [rcx],ax
              ret
@do1:         mov [rcx],al
              rep ret
@do0:         rep ret
end;
{$endif}

This is not that much and is definitely cheaper than branching to check the alignment
I think the checks are quite cheap (see above). Note that you can have pathological cases where the incur the penalty all the time, because the blocks happen to straddle the lines a lot.

About mixing AVX and SSE code
On Intel there is a 300+ cycle penalty for mixing AVX and (legacy, i.e. non-VEX encoded) SSE instructions.
If you use AVX2 instructions to write to memory you'll incur a penalty if you use SSE code in the rest of your application and Delphi 64 uses SSE exclusively for floating point.
Using AVX2 code in this context would incur crippling delays. For this reason alone I suggest you don't consider AVX2.

There is no need for AVX2
You can saturate the memory bus using 64 bit general purpose registers doing just writes.
When doing combined reads and writes, 128 bits reads and writes will also easily saturate the bus.
This is true on older processors and obviously also true if you move beyond the L1 cache, but not true on the latest processors.

Why is there a penalty for mixing AVX and SSE (legacy) code?
Intel writes the following:

Initially the processor is in clean state (1), where Intel SSE and Intel AVX instructions are executed with no penalty. When a 256-bit Intel AVX instruction is executed, the processor marks that it is in the Dirty Upper state (2). While in this state, executing an Intel SSE instruction saves the upper 128 bits of all YMM registers and the state changes to Saved Dirty Upper state (3). Next time an Intel AVX instruction is executed the upper 128 bits of all YMM registers are restored and the processor is back at state (2). These save and restore operations have a high penalty. Frequent execution of these transitions causes significant performance loss.

There is also the issue of dark silicon. AVX2 code uses a lot of hardware, having all that silicon lit up uses a lot of power which affects the thermal headroom. When executing AVX2 code the CPU throttles down, sometimes even below the normal non-turbo threshold. By powering down the circuitry for 256-bit AVX the CPU can achieve higher turbo clocks because of the better thermal headroom. The off switch for AVX2 circuitry is not seeing 256 bit code for a longish time duration (675us) and the on-switch is seeing AVX2 code. Mixing the two causes switching on and off of circuitry which takes many cycles.

Shores answered 17/6, 2017 at 11:19 Comment(18)

Thank you! A month ago I made tests of memory copy on SkyLake with just simple 64-bit registers (rax) and 256 (ymm) registers, and the ymm-code was significantly faster, not just a few percents, but 2-3 times faster, don't remember right now. I will do the tests again and will show the code and the conditions. I will also test your FillChar routine against Delphi Tokyo FillChar routine implemented in SSE+ERMS and my optimized version AVX+ERMS. I don't think you will be able to beat ERMS with just rax moves on SkyLake / Kaby Lake. – Nevels 17/6, 2017 at 12:51

I thing I saw somewhere in the Intel optimization guide that even if the lane is 128-bit, using 256-bit load and store is faster, will try to find the place and will do the tests on Kaby Lake. – Nevels 17/6, 2017 at 12:53

Thank you for your idea of alignment without branching, I will look how can I use it in all the places but the problem is such that it still suggests one unaligned store that may cross page boundary and invoke penalty. To avoid the penalty, Intel recommends using multiple smaller stores until we reach the boundary. And to use multiple smaller stores we will still need branching, isn't it? – Nevels 17/6, 2017 at 12:56

As about the AVX-SSE transition penalty - YES - on older processors it is significant, but only if you intertwine AVX and SSE instructions. If your code is granular enough (e.g. multiple AVX instructions and then multiple SSE instructions, and not vzeroupper), the penalty's effects is negligible to the overall application speed. But on SkyLake, even if you intertwine AVX and SSE instructions (which nobody does) - I only did this for a test case - it was slower by just 10%. – Nevels 17/6, 2017 at 13:1

"And to use multiple smaller stores we will still need branching, isn't it?", no, just do: mov [rdx],rax; mov[rdx+8],rax; mov [rdx+16],rax; mov [rdx+24],rax do simulate an unaligned single AVX2 write. No jumps needed. – Shores 17/6, 2017 at 14:17

Since we have just one store unit, and AVX2 write is just one cycle, even unaligned, but 4 rax writes are 4 cycles. Potential penalty, statistically, is 0.5 cycles. So AVX2 store seems to be faster anyway. I see that on Kaby Lake there is no speed difference between VMOVDQU and VMOVDQA for 256-bit (YMM) store. – Nevels 17/6, 2017 at 16:48

@MaximMasiutin, test the code, you'll see it makes much less difference than you think it does. And there is still the various AVX2 penalties to content with on older CPU's. Keep in mind that CPU's have a live span of 5 to 6 years these days. – Shores 17/6, 2017 at 20:13

Thank you for your reply. My tests have shown that your assumptions are incorrect and that AVX-256 bit stores provide significant speed improvements, on small blocks, at least according to my tests on Kaby Lake - see #43959455 --- on larger blocks, however, it gives the same speed as your code - probably because it reaches out of cache and the memory becomes bottleneck. Please see the fastcode.sourceforge.net FillChar for the test case application. – Nevels 19/6, 2017 at 7:4

You wrote about saturating the bus. I saw tests here on stackoverflow, don't remember exact URLs, people created tests and found out that it is impossible to saturate the bus with just one core whatever you do. So they run multithreaded memory copy and got better results than single-threaded. I should probably do my own tests on that matter. – Nevels 19/6, 2017 at 14:47

@MaximMasiutin Johan's comments about being able to saturate everything with normal word sizes was true 5 years ago (on Sandy Bridge). Of course things have changed since then. So he isn't wrong, just out-of-date with respect to your Kaby Lake. Now, it's sometimes not possible to saturate the memory bus with a single core even if you go all-out SIMD. And with quad-channel memory, you'll need 2 or more cores - even with 256-bit AVX. – Frausto 19/6, 2017 at 18:38

@Frausto - I have also published test on Xeon E5-2603 v2 (Ivy Bridge), release date Sep 10, 2013, and the results are the same as on Kaby Lake : on small sizes, 256-bit YMM stores are faster than 64-bit RAX stores. But maybe the difference is actually because of branching, not the store size. Intel wrote in the optimization guide that even if the bus (lanes) may be narrow, using wide load/stores are recommended, because the same effect is produces with fewer instructions, effectively unloading the CPU. – Nevels 20/6, 2017 at 6:41

@MaximMasiutin If branching is an issue, then the benchmark isn't implemented correctly. Everything should be unrolled enough to where the branching doesn't matter. – Frausto 20/6, 2017 at 15:57

@Frausto - I think that that the benchmark is implemented correctly. By branching I mean branching inside a FillChar implementation, not branching in the test code. I read that Intel doesn't like jumps by table; it may be faster to make several cmp/je than a single jump by a table. – Nevels 20/6, 2017 at 18:16

@MaximMasiutin A single jump by table to choose the right loop isn't going to matter if you're looping over GBs of data. Unless you're calling it many times on tiny blocks at a time. I'm mainly talking about a pure streaming benchmark where you pass over GBs of data in a single loop. You will find these inside optimized memcpy's. ICC's memcpy goes into a loop that unrolls 4 AVX vmovaps load+stores (unroll by 4). GCC will sometimes use non-temporal stores as well. – Frausto 20/6, 2017 at 23:29

@Frausto - I agree that a single jump isn't going to matter if you're copying GBs of data, but if memcpy of FillChart is called to process, say, only 3 bytes - it can make big difference. My tests on SkyLake have also shown that vmovdqa should be unrolled, albeit by 8. Unrolling more doesn't improve speed. Inserting up to 5 scalar register operations in between vmovdqas (e.g. increase source, increase destination, decrease count, compare count, jz) doesn't affect speed, these operations are going to be free. – Nevels 20/6, 2017 at 23:35

Can you use jl or jb instead of js / jns? js / jns can't macro-fuse with sub, but jl and jb can on Intel SnB-family CPUs, so they decode and execute more efficiently. – Compel 30/6, 2017 at 6:6

@PeterCordes, done. Thanks for that tip, wasn't aware that there were limitations on the fusing. – Shores 30/6, 2017 at 8:53

The off-switch for the high 128 bits of the vector ALU isn't seeing a single SSE instruction, it's not seeing any 256-bit instructions for a certain time window. Source: Agner Fog's testing (agner.org/optimize/blog/read.php?i=415), which found the time interval is ~675us. That makes a lot more sense; you wouldn't want to switch frequently (and/or spend a lot of time in slow "warm-up" mode) in code that uses vzeroupper properly to call non-VEX library functions between 256-bit code, and you do want to switch off during a long purely-scalar loop. – Compel 15/10, 2018 at 14:28

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags