Is it safe to read past the end of a buffer within the same page on x86 and x64?

Asked 13/6, 2016 at 23:32 Answered 14/6, 2016 at 2:3

Solved c performance assembly optimization x86

Many methods found in high-performance algorithms could be (and are) simplified if they were allowed to read a small amount past the end of input buffers. Here, "small amount" generally means up to W - 1 bytes past the end, where W is the word size in bytes of the algorithm (e.g., up to 7 bytes for an algorithm processing the input in 64-bit chunks).

It's clear that writing past the end of an input buffer is never safe, in general, since you may clobber data beyond the buffer¹. It is also clear that reading past the end of a buffer into another page may trigger a segmentation fault/access violation, since the next page may not be readable.

In the special case of reading aligned values, however, a page fault seems impossible, at least on x86. On that platform, pages (and hence memory protection flags) have a 4K granularity (larger pages, e.g. 2MiB or 1GiB, are possible, but these are multiples of 4K) and so aligned reads will only access bytes in the same page as the valid part of the buffer.

Here's a canonical example of some loop that aligns its input and reads up to 7 bytes past the end of buffer:

int processBytes(uint8_t *input, size_t size) {

    uint64_t *input64 = (uint64_t *)input, end64 = (uint64_t *)(input + size);
    int res;

    if (size < 8) {
        // special case for short inputs that we aren't concerned with here
        return shortMethod();
    }

    // check the first 8 bytes
    if ((res = match(*input)) >= 0) {
        return input + res;
    }

    // align pointer to the next 8-byte boundary
    input64 = (ptrdiff_t)(input64 + 1) & ~0x7;

    for (; input64 < end64; input64++) {
        if ((res = match(*input64)) > 0) {
            return input + res < input + size ? input + res : -1;
        }
    }

    return -1;
}

The inner function int match(uint64_t bytes) isn't shown, but it is something that looks for a byte matching a certain pattern, and returns the lowest such position (0-7) if found or -1 otherwise.

First, cases with size < 8 are pawned off to another function for simplicity of exposition. Then a single check is done for the first 8 (unaligned bytes). Then a loop is done for the remaining floor((size - 7) / 8) chunks of 8 bytes². This loop may read up to 7 bytes past the end of the buffer (the 7 byte case occurs when input & 0xF == 1). However, return call has a check which excludes any spurious matches which occur beyond the end of the buffer.

Practically speaking, is such a function safe on x86 and x86-64?

These types of overreads are common in high performance code. Special tail code to avoid such overreads is also common. Sometimes you see the latter type replacing the former to silence tools like valgrind. Sometimes you see a proposal to do such a replacement, which is rejected on the grounds the idiom is safe and the tool is in error (or simply too conservative)³.

A note for language lawyers:

Reading from a pointer beyond its allocated size is definitely not allowed in the standard. I appreciate language lawyer answers, and even occasionally write them myself, and I'll even be happy when someone digs up the chapter and verse which shows the code above is undefined behavior and hence not safe in the strictest sense (and I'll copy the details here). Ultimately though, that's not what I'm after. As a practical matter, many common idioms involving pointer conversion, structure access though such pointers and so are technically undefined, but are widespread in high quality and high performance code. Often there is no alternative, or the alternative runs at half speed or less.

If you wish, consider a modified version of this question, which is:

After the above code has been compiled to x86/x86-64 assembly, and the user has verified that it is compiled in the expected way (i.e., the compiler hasn't used a provable partially out-of-bounds access to do something really clever, is executing the compiled program safe?

In that respect, this question is both a C question and a x86 assembly question. Most of the code using this trick that I've seen is written in C, and C is still the dominant language for high performance libraries, easily eclipsing lower level stuff like asm, and higher level stuff like <everything else>. At least outside of the hardcore numerical niche where FORTRAN still plays ball. So I'm interested in the C-compiler-and-below view of the question, which is why I didn't formulate it as a pure x86 assembly question.

All that said, while I am only moderately interested in a link to the standard showing this is UD, I am very interested in any details of actual implementations that can use this particular UD to produce unexpected code. Now I don't think this can happen without some deep pretty deep cross-procedure analysis, but the gcc overflow stuff surprised a lot of people too...

¹ Even in apparently harmless cases, e.g., where the same value is written back, it can break concurrent code.

² Note for this overlapping to work requires that this function and match() function to behave in a specific idempotent way - in particular that the return value supports overlapping checks. So a "find first byte matching pattern" works since all the match() calls are still in-order. A "count bytes matching pattern" method would not work, however, since some bytes could be double counted. As an aside: some functions such as "return the minimum byte" call would work even without the in-order restriction, but need to examine all bytes.

³ It's worth noting here that for valgrind's Memcheck there is a flag, --partial-loads-ok which controls whether such reads are in fact reported as an error. The default is yes, means that in general such loads are not treated as immediate errors, but that an effort is made to track the subsequent use of loaded bytes, some of which are valid and some of which are not, with an error being flagged if the out-of-range bytes are used. In cases such as the example above, in which the entire word is accessed in match(), such analysis will conclude the bytes are accessed, even though the results are ultimately discarded. Valgrind cannot in general determine whether invalid bytes from a partial load are actually used (and detection in general is probably very hard).

Ribald answered 13/6, 2016 at 23:32 Comment(17)

Theoretically a C compiler could implement its own checks that are more restrictive than those of the underlying hardware. – Amatory 13/6, 2016 at 23:43

If your user has verified that it is compiled in "the expected way", where the expected way is that the access is safe, then it is safe. Unfortunately if your user is not reading the assembly intermediate code he/she is not going to have any such guarantees. Don't do it. (You can make it safe by implementing your own memory managment) – Bouley 13/6, 2016 at 23:44

This looks more like an answer than a question :) As for the special tail code, that's normally only done if the algorithm proceeds in chunks but doesn't align first. – Lepidopteran 13/6, 2016 at 23:44

Why don't you just process all the 8-byte chunks using the loop, and then call shortMethod() for the last chunk? – Amatory 13/6, 2016 at 23:47

@Lepidopteran - you have perhaps detected my bias in that I think it is safe. Still I'm looking for answers that have good counter-examples showing that it is not safe, or plausible reasons why it may not be safe in the future, or even stronger reasoning why it is safe. At the very least, it can be a good link to point people too since this question comes up all the time in implementation, review, and discussion of high-perf code, but solid info about the practice is widely spread and hard to find. – Ribald 13/6, 2016 at 23:49

@Lepidopteran which tail cover are you referring to? I don't have any tail code to process the unaligned final part of the buffer, which is why this approach is fast (at least subject to caveats like the underlying hardware having fast unaligned access). – Ribald 13/6, 2016 at 23:55

@Amatory because in many real-world cases the shortMethod() code, which generally proceeds a byte at a time, may be 8 times slower, per byte, than the loop above. So if you have, on average, ~40 byte chunks, you may easily spend as much of your time processing the ~4 tail bytes compared the other ~36 "main" bytes. Added to that, having 2 loops (main and tail) instead of one will often incur 2x the mispredicts - one for each loop, and sometimes worse (since the main loop effectively quanitzes the loop counts into buckets of 8). – Ribald 13/6, 2016 at 23:57

@Amatory ... and in the case of SIMD code, it may be 16 or 32 or ... times worse, and Amdahl's law will only kick in harder over time as vector lengths get longer. – Ribald 13/6, 2016 at 23:58

Well, there's always asm(). :) – Amatory 14/6, 2016 at 0:0

@Bouley - indeed, I said "in the expected way", not "in the safe way" because there are two aspects to this question. One is the likelihood of it compiling in the expected way. So far, it seems that it does, but I'm highly interested in cases where this might not be true. Lots of code compiled in the expected way in gcc too, until the signed overflow optimizations were implemented. So I'm interested in reasonable ways that kind of thing could occur here. – Ribald 14/6, 2016 at 0:1

Let me put it this way: don't ever "expect" a compiler to do a thing a certain way because it has done it that way before under conditions that seem similar to the programmer. That is the path to madness and non-portable code (hey, what's VC+ do? A hacked-up LLVM? etc) You give a great example of why to not do stuff like this above... – Bouley 14/6, 2016 at 0:3

@BadZen: Secondly, I'm interested in ways that even the expected assembly may not be safe. For example, someone might say "cache-line granularity memory protection is coming to/already exists in x86". Or they might find another way in which it is not safe - see for example my "writes past the end of the buffer" example for an idiom which was once considered safe, but was rendered safe by multi-CPU architectures. – Ribald 14/6, 2016 at 0:3

@Amatory - see my "Secondly..." answer to BadZen above for why this applies even to code you've written in asm by hand. BadZen - don't worry, I don't expect that. In particular, I'm looking for good reasons why this pattern may fail due compiler enhancements in the future. I'm reasonably well versed in compiler technology so don't hold back and be specific! – Ribald 14/6, 2016 at 0:5

With regard to your first question, C makes no guarantees that the memory model you are working with even corresponds to anything in the underlying hardware for that sort of 'edge case' (with a couple of exceptions for things like word size, and even then it struggles). So no-go on that front. The "language legalese" says 'undefined' for good reason. With regard to the second question, you'd need to post specific ASM for the question to be meaningful. – Bouley 14/6, 2016 at 0:11

Special tail code to avoid such overreads is also common - I was referring to that. – Lepidopteran 14/6, 2016 at 0:15

@Lepidopteran - I'm not following. If the algorithm proceeds in chunks, whether it aligns or not, tail code is normally necessary. If processes in W-byte chunks and it aligns on an W-byte boundary, tail code is needed any time the end of the buffer doesn't fall on a W-byte boundary. If it proceeds in W-byte chunks without aligning, tail code is necessary any time input size is not a multiple of W. So in the absence of overread, tail code is necessary in general, and also "usually" if sizes are uniformly distributed. – Ribald 14/6, 2016 at 0:28

@Amatory - right, but I'm talking about C on x86. I'm interested in any real-world examples of x86 C compilers which compile this in a way that make the idiom unsafe. I disagree that explicit asm needs to be posted - just assume the "obvious" ASM implied by the C code. The exact asm doesn't matter really, just assume it overreads the last byte exactly like the sample code. – Ribald 14/6, 2016 at 0:31

Yes, it's safe in x86 asm, and existing libc strlen(3) implementations take advantage of this in hand-written asm. And even glibc's fallback C, but it compiles without LTO so it it can never inline. It's basically using C as a portable assembler to create machine code for one function, not as part of a larger C program with inlining. But that's mostly because it also has potential strict-aliasing UB, see my answer on the linked Q&A. You probably also want a GNU C __attribute__((may_alias)) typedef instead of plain unsigned long as your wider type, like __m128i etc. already use.

It's safe because an aligned load will never cross a higher alignment boundary, and memory protection happens with aligned pages, so at least 4k boundaries¹ Any naturally-aligned load that touches at least 1 valid byte can't fault. It's also safe to just check if you're far enough from the next page boundary to do a 16-byte load, like if (p & 4095 > (4096 - 16)) do_special_case_fallback. See the section below about that for more detail.

It's also generally safe in C compiled for x86, as far as I know. Reading outside an object is of course Undefined Behaviour in C, but works in C-targeting-x86. I don't think compilers explicitly / on purpose define the behaviour, but in practice it works that way.

I think it's not the kind of UB that aggressive compilers will assume can't happen while optimizing, but confirmation from a compiler-writer on this point would be good, especially for cases where it's easily provable at compile-time that an access goes out of past the end of an object. (See discussion in comments with @RossRidge: a previous version of this answer asserted that it was absolutely safe, but that LLVM blog post doesn't really read that way).

This is required in asm to go faster than 1 byte at a time processing an implicit-length string. In C in theory a compiler could know how to optimize such a loop, but in practice they don't so you have to do hacks like this. Until that changes, I suspect that the compilers people care about will generally avoid breaking code that contains this potential UB.

There's no danger when the overread isn't visible to code that knows how long an object is. A compiler has to make asm that works for the case where there are array elements as far as we actually read. The plausible danger I can see with possible future compilers is: after inlining, a compiler might see the UB and decide that this path of execution must never be taken. Or that the terminating condition must be found before the final not-full-vector and leave that out when fully unrolling.

The data you get is unpredictable garbage, but there won't be any other potential side-effects. As long as the your program isn't affected by the garbage bytes, it's fine. (e.g. use bithacks to find if one of the bytes of a uint64_t are zero, then a byte loop to find the first zero byte, regardless of what garbage is beyond it.)

Unusual situations where this wouldn't be safe in x86 asm

Hardware data breakpoints (watchpoints) that trigger on a load from a given address. If there's a variable you're monitoring right after an array, you could get a spurious hit. This might be a minor annoyance to someone debugging a normal program. If your function will be part of a program that uses x86 debug registers D0-D3 and the resulting exceptions for something that could affect correctness, then be careful with this.

Or similarly a code checker like valgrind could complain about reading outside an object.
Under a hypothetical 16 or 32-bit OS could that uses segmentation: A segment limit can use 4k or 1-byte granularity so it's possible to create a segment where the first faulting offset is odd. (Having the base of the segment aligned to a cache line or page is irrelevant except for performance). All mainstream x86 OSes use flat memory models, and x86-64 removes support for segment limits for 64-bit mode.
Memory-mapped I/O registers right after the buffer you wanted to loop over with wide loads, especially the same 64B cache-line. This is extremely unlikely even if you're calling functions like this from a device driver (or a user-space program like an X server that has mapped some MMIO space).

If you're processing a 60-byte buffer and need to avoid reading from a 4-byte MMIO register, you'll know about it and will be using a volatile T*. This sort of situation doesn't happen for normal code.

strlen is the canonical example of a loop that processes an implicit-length buffer and thus can't vectorize without reading past the end of a buffer. If you need to avoid reading past the terminating 0 byte, you can only read one byte at a time.

For example, glibc's implementation uses a prologue to handle data up to the first 64B alignment boundary. Then in the main loop (gitweb link to the asm source), it loads a whole 64B cache line using four SSE2 aligned loads. It merges them down to one vector with pminub (min of unsigned bytes), so the final vector will have a zero element only if any of the four vectors had a zero. After finding that the end of the string was somewhere in that cache line, it re-checks each of the four vectors separately to see where. (Using the typical pcmpeqb against a vector of all-zero, and pmovmskb / bsf to find the position within the vector.) glibc used to have a couple different strlen strategies to choose from, but the current one is good on all x86-64 CPUs.

Usually loops like this avoid touching any extra cache-lines they don't need to touch, not just pages, for performance reasons, like glibc's strlen.

Loading 64B at a time is of course only safe from a 64B-aligned pointer, since naturally-aligned accesses can't cross cache-line or page-line boundaries.

If you do know the length of a buffer ahead of time, you can avoid reading past the end by handling the bytes beyond the last full aligned vector using an unaligned load that ends at the last byte of the buffer.

(Again, this only works with idempotent algorithms, like memcpy, which don't care if they do overlapping stores into the destination. Modify-in-place algorithms often can't do this, except with something like converting a string to upper-case with SSE2, where it's ok to reprocess data that's already been upcased. Other than the store-forwarding stall if you do an unaligned load that overlaps with your last aligned store.)

So if you are vectorizing over a buffer of known length, it's often best to avoid overread anyway.

Non-faulting overread of an object is the kind of UB that definitely can't hurt if the compiler can't see it at compile time. The resulting asm will work as if the extra bytes were part of some object.

But even if it is visible at compile-time, it generally doesn't hurt with current compilers.

PS: a previous version of this answer claimed that unaligned deref of int * was also safe in C compiled for x86. That is not true. I was a bit too cavalier 3 years ago when writing that part. You need a typedef with GNU C __attribute__((aligned(1),may_alias)), or memcpy, to make that safe. The may_alias part isn't needed if you only access it via signed/unsigned int* and/or `char*, i.e. in ways that wouldn't violate the normal C strict-aliasing rules.

The set of things ISO C leaves undefined but that Intel intrinsics requires compilers to define does include creating unaligned pointers (at least with types like __m128i*), but not dereferencing them directly. Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?

Checking if a pointer is far enough from the end of a 4k page

This is useful for the first vector of strlen; after this you can p = (p+16) & -16 to go to the next aligned vector. This will partially overlap if p was not 16-byte aligned, but doing redundant work is sometimes the most compact way to set up for an efficient loop. Avoiding it might mean looping 1 byte at a time until an alignment boundary, and that's certainly worse.

e.g. check ((p + 15) ^ p) & 0xFFF...F000 == 0 (LEA / XOR / TEST) which tells you that the last byte of a 16-byte load has the same page-address bits as the first byte. Or p+15 <= p|0xFFF (LEA / OR / CMP with better ILP) checks that the last byte-address of the load is <= the last byte of the page containing the first byte.

Or more simply, p & 4095 > (4096 - 16) (MOV / AND / CMP), i.e. p & (pgsize-1) < (pgsize - vecwidth) checks that the offset-within-page is far enough from the end of a page.

You can use 32-bit operand-size to save code size (REX prefixes) for this or any of the other checks because the high bits don't matter. Some compilers don't notice this optimization, so you can cast to unsigned int instead of uintptr_t, although to silence warnings about code that isn't 64-bit clean you might need to cast (unsigned)(uintptr_t)p. Further code-size saving can be had with ((unsigned int)p << 20) > ((4096 - vectorlen) << 20) (MOV / SHL / CMP), because shl reg, 20 is 3 bytes, vs. and eax, imm32 being 5, or 6 for any other register. (Using EAX will also allow the no-modrm short form for cmp eax, 0xfff.)

If doing this in GNU C, you probably want typedef unsigned long aliasing_unaligned_ulong __attribute__((aligned(1),may_alias)); to make it safe to do unaligned accesses.

Mete answered 14/6, 2016 at 2:3 Comment(20)

Err,umm... strlen (sort of) takes advantage of this, not by actually reading beyond the end of the buffer, but in casting to unsigned (if I recall correctly) and then unrolling and checking each of the 4-bytes for a nul-byte (in order) and then bailing at the nul-byte prior to actually accessing the nul-byte + 1. I'm not saying its a bad analogy, but it's not quite a 1:1 analogy either. – Upstairs 14/6, 2016 at 4:49

@DavidC.Rankin: Think about what it means to load a uint32_t from memory into a register, when the terminating 0 might be the first byte. And besides that, I linked and explained the actual asm source for glibc's strlen, which reads in 64-byte chunks. So it reads up to 63 bytes beyond the end of the string, using 16-byte vectors. – Mete 14/6, 2016 at 5:0

If I understand what your are saying, when the cast is made, even though there hasn't been an access, the uint32_t loaded into whatever register for examination is a read beyond the end of the buffer. In that case, I do agree it would be an example in that regard. I was considering the other side of the same coin where while the cast was made, there had been no dereference of the byte beyond the nul-byte. "dereference" is probably the wrong word, but no jump based on the value of nul-byte + 1. – Upstairs 14/6, 2016 at 5:7

@DavidC.Rankin: uint32_t foo = *(uint32_t*)aligned_pointer will compile to a 32bit load. It doesn't matter if you only test the bytes of foo one at a time. If the behaviour of your code depends on what's in the bytes after the terminating 0, that's a bug, but loading them at all is what might cause a problem. Access checks happen on loads/stores; no information about where data came from is tracked by registers. glibc's strlen implementation even feeds the whole 64B through the ALUs to comine it down to one thing it can branch on. – Mete 14/6, 2016 at 5:17

(fun fact: weakly ordered architectures other than Alpha have rules for propagating dependencies through ALU operations, to support what C++11 calls memory_order_consume. But that's not at all the same thing as delaying access checks or MMIO side-effects until an ALU operation looks at certain bytes). @David: This is an x86 question, not a question about the C abstract machine. Reading outside a C object is undefined behaviour in C. It's well-defined what happens when you do it in C targeting x86, though. This isn't the kind of UB that compilers might decide can't happen when optimizing – Mete 14/6, 2016 at 5:20

@DavidC.Rankin: your comment got me thinking about C undefined-behaviour rules. That's actually a good point, since this is a C question, not just an x86 asm question. Updated my answer accordingly. Does that address what you were trying to get at with your comment? – Mete 14/6, 2016 at 5:39

Yes, that hits the nail on the head. Your scope of thinking is much broader than mine. I was looking at it from a narrow perspective of C and when the UB would occur based upon those rules. That's why I knew what you were saying and new we were talking about the same info in the same register from two different points of view. Good discussion, I always end up learning something from them. – Upstairs 14/6, 2016 at 7:43

Thanks @PeterCordes, that's a comprehensive answer. Noting that existing widely used implementations do this gives a lot of weight to the idea that it's OK in other code too (for the limited cases where it makes a measurable difference). – Ribald 16/6, 2016 at 23:23

I don't see were the LLVM blog entry says that its safe in C code. Instead it seem to say the opposite. That "Dereferences of Wild Pointers and Out of Bounds Array Accesses" are treated as undefined behaviour by Clang and GCC, and so subject to optimization. – Courteous 18/6, 2016 at 19:17

@RossRidge: Hmm, I think you're right; there might actually be a problem with doing this in C if the compiler can prove something about the array bounds at compile-time (or link-time optimization). I think it's always safe in practice, but maybe only with vector loads, since __m128i and so on are defined in gcc/clang as may_alias. I'd love to hear from a compiler-internals expert about whether my potentially over-confident assertions are correct. – Mete 19/6, 2016 at 4:49

If you have an array of known length, I think it's usually best to handle the last elements with an unaligned load that stops at the end anyway. So in practice, I think it should only be done in cases where the iteration count isn't known at the start of the loop, so the compiler will not be able to prove anything anyway. – Mete 19/6, 2016 at 4:51

There are at least 3 cases (for 80x86) where reading a little more may not (depending on scenario and OS) be safe: a) One or more "hardware data breakpoints" have been setup in the extra area, b) it's a memory mapped device with read side-effects, c) segmentation is being used – Slovenia 29/8, 2019 at 14:40

@Brendan: Thanks, collected the existing mentions of the last 2 into a list, adding HW breakpoints. – Mete 29/8, 2019 at 15:42

re: p+15 <= p|0xFFF why just not p % 4096 > 4080? – Canteen 26/3, 2021 at 3:49

@Noah: Yeah, p & (pgsize-1) > (pagesize - vectorlen) is cheaper to evaluate, and just as easy to think about. Good suggestion. – Mete 26/3, 2021 at 3:59

@PeterCordes Slightly better from a code side perspective is: ((unsigned int)p << 20) > ((pgsize - vectorlen) << 20). Same number of uops. – Canteen 26/3, 2021 at 8:14

@PeterCordes do you know of any questions about fastest way to detect if a load will cause a page cross in multiple buffers? I.e fastest way to check if VEC_SIZE load will cause a page cross in either buffer in someone like memcmp. With some false positives orl r0, r1; sall 20, r1; cmpl (4096 - VEC_SIZE) << 20, r1; ja is fast but cant think of a non-expensive way to get accurate version for 2. – Canteen 22/4, 2021 at 16:59

Best I can come up with is lea VSIZE(r0), r1; lea VSIZE(r2), r3; xor r0, r1; xor r2, r3; or r1, r3; test PAGE_SIZE, r3 but that 4c latency. – Canteen 22/4, 2021 at 17:7

re: "p = (p+16) & -16" generally either save code size or an ALU using p | (VEC_SIZE - 1) and have + 1 base offset for address mode. Any aligned load from p = (p + 16) & -16 that can use imm8 offset will also use imm8 with p | 15. for p & -16 though you will only be able to do 7 vecs with imm8 encoding whereas p | 15 you will be able to do 8 for same ALU cost (Exception is evex encoding where odd offsets appears to have VERY long encoding). If you need "next address" p | 15; p += 1 will save a byte because incl vs addl. – Canteen 22/4, 2021 at 17:36

@Noah: I was playing around with similar ideas for The right way to use function _mm_clflush to flush a large struct, for trying to hit all cache-lines of a struct. Might be something there. You could ask a separate SO question about micro-optimizing page-crossing detection for memcpy, would be a good place for me to put anything I come up with, and to get input from other people. – Mete 23/4, 2021 at 0:53

If you permit consideration of non-CPU devices, then one example of a potentially unsafe operation is accessing out-of-bounds regions of PCI-mapped memory pages. There's no guarantee that the target device is using the same page size or alignment as the main memory subsystem. Attempting to access, for example, address [cpu page base]+0x800 may trigger a device page fault if the device is in a 2KiB page mode. This will usually cause a system bugcheck.

Snowden answered 14/6, 2016 at 0:17 Comment(7)

Can user-space code access such memory? Does access past the end of a PCI page trigger a page fault on x86/x86-64 systems? – Ribald 14/6, 2016 at 0:24

@Ribald Generally only the OS and kernel-mode components are allowed to create this kind of mapping, but there are several paths in which a kernel-mode component will hand off the mapped region to user-mode. For example, CUDA does this, and for similar performance reasons to the CPU side, usually does not perform any bounds checking on accesses. Accessing off the end will trigger a device page fault, which is usually worse than a process page fault, and often leaves the OS unrecoverable. Not sure about CUDA specifically though. – Snowden 14/6, 2016 at 0:36

Interesting. So if some PCI device is mapped at 0x50-0x94, and then I do an 8-byte read at 0x90, the CPU will pass through something like {8 byte read at 0x90 - 0x50 = 0x40} and then the PCI device will barf because its mapped region only covers (94-50) = 0x44 bytes? Or where exactly does the redirection from a memory access to a PCI device access happen? Kernel level? Hardware (CPU/MMU) level? – Ribald 14/6, 2016 at 0:41

That seems like an OS bug if it hands off a mapping to user space in such a way that the user-mode process can perform an access that crashes the whole system. Regardless of what the C spec says about undefined behavior, operating systems are not supposed to allow user-mode code to cause unrecoverable system-level errors. Anything undefined should be confined to the process. – Amatory 14/6, 2016 at 0:49

@Barmar: It happens all the time that sufficiently privileged user-mode programs get direct access to hardware, which is certainly sufficient to crash the system. man 2 iopl on a Linux box if you'd like to play around. X servers would likely be unusably slow if they didn't do this. (Or for a more dignified way for a userspace program to crash the system, man 2 shutdown.) – Cisalpine 14/6, 2016 at 1:7

Yeah, after I posted that I realized that the operation to get direct access is presumably limited to privileged users or applications, and they're expected to be safe (since a privileged user can also do things like shut down the system). – Amatory 14/6, 2016 at 1:9

@NateEldredge: IIRC, iopl is only for using the in / out instructions. Most modern hardware uses memory-mapped I/O for most of its interface, and software gets access to that by memory-mapping /dev/mem on Linux. But yes, user-space software can and does access hardware directly. – Mete 14/6, 2016 at 1:10

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Unusual situations where this wouldn't be safe in x86 asm

Checking if a pointer is far enough from the end of a 4k page

Recommended topics

Hot tags