Many methods found in high-performance algorithms could be (and are) simplified if they were allowed to read a small amount past the end of input buffers. Here, "small amount" generally means up to W - 1
bytes past the end, where W
is the word size in bytes of the algorithm (e.g., up to 7 bytes for an algorithm processing the input in 64-bit chunks).
It's clear that writing past the end of an input buffer is never safe, in general, since you may clobber data beyond the buffer1. It is also clear that reading past the end of a buffer into another page may trigger a segmentation fault/access violation, since the next page may not be readable.
In the special case of reading aligned values, however, a page fault seems impossible, at least on x86. On that platform, pages (and hence memory protection flags) have a 4K granularity (larger pages, e.g. 2MiB or 1GiB, are possible, but these are multiples of 4K) and so aligned reads will only access bytes in the same page as the valid part of the buffer.
Here's a canonical example of some loop that aligns its input and reads up to 7 bytes past the end of buffer:
int processBytes(uint8_t *input, size_t size) {
uint64_t *input64 = (uint64_t *)input, end64 = (uint64_t *)(input + size);
int res;
if (size < 8) {
// special case for short inputs that we aren't concerned with here
return shortMethod();
}
// check the first 8 bytes
if ((res = match(*input)) >= 0) {
return input + res;
}
// align pointer to the next 8-byte boundary
input64 = (ptrdiff_t)(input64 + 1) & ~0x7;
for (; input64 < end64; input64++) {
if ((res = match(*input64)) > 0) {
return input + res < input + size ? input + res : -1;
}
}
return -1;
}
The inner function int match(uint64_t bytes)
isn't shown, but it is something that looks for a byte matching a certain pattern, and returns the lowest such position (0-7) if found or -1 otherwise.
First, cases with size < 8 are pawned off to another function for simplicity of exposition. Then a single check is done for the first 8 (unaligned bytes). Then a loop is done for the remaining floor((size - 7) / 8)
chunks of 8 bytes2. This loop may read up to 7 bytes past the end of the buffer (the 7 byte case occurs when input & 0xF == 1
). However, return call has a check which excludes any spurious matches which occur beyond the end of the buffer.
Practically speaking, is such a function safe on x86 and x86-64?
These types of overreads are common in high performance code. Special tail code to avoid such overreads is also common. Sometimes you see the latter type replacing the former to silence tools like valgrind. Sometimes you see a proposal to do such a replacement, which is rejected on the grounds the idiom is safe and the tool is in error (or simply too conservative)3.
A note for language lawyers:
Reading from a pointer beyond its allocated size is definitely not allowed in the standard. I appreciate language lawyer answers, and even occasionally write them myself, and I'll even be happy when someone digs up the chapter and verse which shows the code above is undefined behavior and hence not safe in the strictest sense (and I'll copy the details here). Ultimately though, that's not what I'm after. As a practical matter, many common idioms involving pointer conversion, structure access though such pointers and so are technically undefined, but are widespread in high quality and high performance code. Often there is no alternative, or the alternative runs at half speed or less.
If you wish, consider a modified version of this question, which is:
After the above code has been compiled to x86/x86-64 assembly, and the user has verified that it is compiled in the expected way (i.e., the compiler hasn't used a provable partially out-of-bounds access to do something really clever, is executing the compiled program safe?
In that respect, this question is both a C question and a x86 assembly question. Most of the code using this trick that I've seen is written in C, and C is still the dominant language for high performance libraries, easily eclipsing lower level stuff like asm, and higher level stuff like <everything else>. At least outside of the hardcore numerical niche where FORTRAN still plays ball. So I'm interested in the C-compiler-and-below view of the question, which is why I didn't formulate it as a pure x86 assembly question.
All that said, while I am only moderately interested in a link to the standard showing this is UD, I am very interested in any details of actual implementations that can use this particular UD to produce unexpected code. Now I don't think this can happen without some deep pretty deep cross-procedure analysis, but the gcc overflow stuff surprised a lot of people too...
1 Even in apparently harmless cases, e.g., where the same value is written back, it can break concurrent code.
2 Note for this overlapping to work requires that this function and match()
function to behave in a specific idempotent way - in particular that the return value supports overlapping checks. So a "find first byte matching pattern" works since all the match()
calls are still in-order. A "count bytes matching pattern" method would not work, however, since some bytes could be double counted. As an aside: some functions such as "return the minimum byte" call would work even without the in-order restriction, but need to examine all bytes.
3 It's worth noting here that for valgrind's Memcheck there is a flag, --partial-loads-ok
which controls whether such reads are in fact reported as an error. The default is yes, means that in general such loads are not treated as immediate errors, but that an effort is made to track the subsequent use of loaded bytes, some of which are valid and some of which are not, with an error being flagged if the out-of-range bytes are used. In cases such as the example above, in which the entire word is accessed in match()
, such analysis will conclude the bytes are accessed, even though the results are ultimately discarded. Valgrind cannot in general determine whether invalid bytes from a partial load are actually used (and detection in general is probably very hard).
shortMethod()
for the last chunk? – Amatoryasm()
. :) – Amatory