SSE: unaligned load and store that crosses page boundary
Asked Answered
G

1

8

I read somewhere that before performing unaligned load or store next to page boundary (e.g. using _mm_loadu_si128 / _mm_storeu_si128 intrinsics), code should first check if whole vector (in this case 16 bytes) belongs to the same page, and switch to non-vector instructions if not. I understand that this is needed to prevent coredump if next page does not belong to process.

But what if both pages belongs to process (e.g. they are part of one buffer, and I know size of that buffer)? I wrote small test program which performed unaligned load and store that crossed page boundary, and it did not crash. Do I have to always check for page boundary in such case, or it is enough to make sure I will not overflow the buffer?

Env: Linux, x86_64, gcc

Goodwill answered 9/6, 2016 at 21:27 Comment(0)
B
10

Page-line splits are bad for performance, but don't affect correctness of unaligned accesses. It is enough to make sure you don't read past the end of the buffer, when you know the length ahead of time.


For correctness, you often need to worry about it when implementing something like strlen, where your loop stops when you find a sentinel value. That value could be at any position within your vector, so just doing 16B unaligned loads will read past the end of the array. If the terminating 0 is in the last byte of one page, and the next page is not readable, and your current-position pointer is unaligned, a load that includes the 0 byte will also include bytes from the unreadable page, so it will fault.

One solution is to do scalar until your pointer is aligned, then load aligned vectors. An aligned load always comes entirely from one page, and also from one cache-line. So even though you will read some bytes past the end of the string, you are guaranteed not to fault. Valgrind might be unhappy about it, though, but standard library strlen implementations use this.

Instead of scalar until an aligned pointer, you could do an unaligned vector from the start of the string (as long as that won't cross a page-line), and then do aligned loads. The first aligned load will overlap the first unaligned load, but that's totally fine for a function like strlen that doesn't care if it sees the same data twice.


It might be worth avoiding page-line splits for performance reasons. Even if you know your src pointer is misaligned, it's often faster to let the hardware handle cache-line splits. But before Skylake, page-splits have an extra ~100c latency. (Down to 5c in Skylake). If you have multiple pointers that can be aligned differently relative to each other, you can't always just use a prologue to align your src. (e.g. c[i] = a[i] + b[i], and c is aligned but b isn't.)

In that case, it might be worth using a branch to do aligned loads from before and after the page split, and combine them with palignr.

A branch mispredict (~15c) is cheaper than the page-split latency, but delays everything (not just the load). So it might also not be worth it, depending on the hardware and ratio of computation to memory access.


If you're writing a function that is usually called with aligned pointers, it makes sense to just use unaligned load/store instructions. Any prologue to detect misalignment is just extra overhead for the already-aligned case, and on modern hardware (Nehalem and newer), unaligned loads on address that turn out to be aligned at runtime have identical performance to aligned load instructions. (But you need AVX for unaligned loads to fold into other instructions as memory operands. e.g. vpxor xmm0, xmm1, [rsi])

By adding code to handle misaligned inputs, you're slowing down the common aligned case to speed up the uncommon misaligned case. Fast hardware support for unaligned loads/stores lets software just leave that to the hardware for the few cases where it does happen.

(If misaligned inputs are common, then it is worth it to use a prologue to align your input pointer, esp. if you're using AVX. Sequential 32B AVX loads will cache-line split every other load.)

See Agner Fog's Optimizing Assembly guide for more info, and other links in the tag wiki.

Brisco answered 10/6, 2016 at 1:49 Comment(4)
@ZheyuanLi: Yeah, I'm curious what design change enabled that. Skylake can also do two page-walks in parallel to resolve two TLB misses. Those two facts may be connected.Brisco
Thanks!. I also did not realize that cross-page access may have such high cost. So this is definitely something to look for.Unfreeze
BTW, Valgrind have option --partial-loads-ok=yes which can hide "Invalid read" issues caused by vector loads when loaded data is past end of buffer.Unfreeze
@DanielFrużyński - this option only works in limited cases. It doesn't simply allow all partial loads, but makes an effort to track the valid and invalid bytes in the loaded word, and then still issues an "invalid read" warning if an invalid byte is subsequetly accessed. Unfortunately,, many common idioms "access" the invalid bytes, but are still correct because they throw away any such invalid result - e.g., memchr may cap the result to the size of the region. Also, certain other instructions like bsf also cause all bytes to be flagged as accessed, even though the behavior is correct.Mogador

© 2022 - 2024 — McMap. All rights reserved.