Why does this implementation of strlen() work?
Asked Answered
S

2

21

(Disclaimer: I've seen this question, and I am not re-asking it -- I am interested in why the code works, and not in how it works.)

So here's this implementation of Apple's (well, FreeBSD's) strlen(). It uses a well-known optimization trick, namely it checks 4 or 8 bytes at once, instead of doing a byte-by-byte comparison to 0:

size_t strlen(const char *str)
{
    const char *p;
    const unsigned long *lp;

    /* Skip the first few bytes until we have an aligned p */
    for (p = str; (uintptr_t)p & LONGPTR_MASK; p++)
        if (*p == '\0')
            return (p - str);

    /* Scan the rest of the string using word sized operation */
    for (lp = (const unsigned long *)p; ; lp++)
        if ((*lp - mask01) & mask80) {
        p = (const char *)(lp);
        testbyte(0);
        testbyte(1);
        testbyte(2);
        testbyte(3);
#if (LONG_BIT >= 64)
        testbyte(4);
        testbyte(5);
        testbyte(6);
        testbyte(7);
#endif
    }

    /* NOTREACHED */
    return (0);
}

Now my question is: maybe I'm missing the obvious, but can't this read past the end of a string? What if we have a string of which the length is not divisible by the word size? Imagine the following scenario:

|<---------------- all your memories are belong to us --------------->|<-- not our memory -->
+-------------+-------------+-------------+-------------+-------------+ - -
|     'A'     |     'B'     |     'C'     |     'D'     |      0      |
+-------------+-------------+-------------+-------------+-------------+ - -
^                                                      ^^
|                                                      ||
+------------------------------------------------------++-------------- - -
                       long word #1                      long word #2

When the second long word is read, the program accesses bytes that it shouldn't in fact be accessing... isn't this wrong? I'm pretty confident that Apple and the BSD folks know what they are doing, so could someone please explain why this is correct?

One thing I've noticed is that beerboy asserted this to be undefined behavior, and I also believe it indeed is, but he was told that it isn't, because "we align to word size with the initial for loop" (not shown here). However, I don't see at all why alignment would be any relevant if the array is not long enough and we are reading past its end.

Salinometer answered 1/12, 2013 at 12:58 Comment(3)
Note: Aside from the potential UB well answered, the method's "5.2 times as fast" by taking advantage that char is typically ASCII (codes 0 - 127) is itself noteworthy.Henning
As a general principle the code that appears in a standard library implementation doesn't exist in a standard C environment. Sometimes this is to the "benefit" of the code -- it can rely on implementation-specific guarantees that the standard doesn't offer, as in this case. Sometimes it's to the "detriment" of the code -- for example there might be an obvious implementation of X in terms of Y that you can't do because you already chose the obvious implementation of Y in terms of X.Rella
For x86, duplicate of Is it safe to read past the end of a buffer within the same page on x86 and x64?Casimiracasimire
K
23

Although this is technically undefined behavior, in practice no native architecture checks for out-of-bounds memory access at a finer granularity than the size of a word. So while garbage past the terminator may end up being read, the result will not be a crash.

Kevon answered 1/12, 2013 at 13:4 Comment(15)
I gues it might crash if it happens to be at the boundary of a guard page?Gradient
No, because every word which is read is guaranteed to be at least partially in the array, and hence not trapped.Kevon
There is a school of thinking that asserts that any operation merely touching random bytes leads into an undefined state. However, the mere act of reading data is not UB -- and the random data that gets accessed "after" the end of the string gets never processed.Stanleigh
@Stanleigh In fact, reading uninitialized variables, for example, is in itself undefined behavior, so is accessing an array out of bounds.Salinometer
@H2CO3: I don't think it is. Consider if (x != 0 && y/x) -- dividing by zero is bad but it is never done. Reading a memory-wise valid but undefined byte and never doing anything with it sounds okay to me. I agree this philosophically touches upon the sound of a falling tree. Maybe we should call it "quantum behavior".Stanleigh
@Stanleigh What "sounds OK" to you needn't agree with what's in the Standard. And the Standard says it's UB, so it's UB, period.Salinometer
@H2CO3: that's rather worrisome, I use strlen all the time! What should I use instead?Stanleigh
@Devolus: notice that the code starts reading byte by byte until it reaches a word-aligned address. Then, only words containing at least some part of the array are read. Though, it's easy to blunder and read one word too much: #18525008Brazee
@Stanleigh strlen(). I don't doubt it works on actual, specific implementations, despite being UB, but that is only the case because the writers of the standard library have intimate knowledge of the platform and/or the compiler they are using. In a sense, it can be OK for library/other implementation code to rely on certain aspects of UB, but it is never OK for user code to do so.Salinometer
@Jongware: the implementation has additional guarantess that the standard doesn't mandate, and abuses them in their internal implementation of strlen(). You don't care how strlen() is implemented internally, you use the guarantees the standard promises you abot strlen().Brazee
@H2CO3: can we agree, then, that in the case under discussion -- this implementation of strlen -- UB is not triggered because it relies on the fact that any bytes in the UB DangerZone are never tested?Stanleigh
@Stanleigh I think yes... or, more precisely, one can tell this if one considers the actual, low-level implementation details of the actual OS, compiler, etc., and it's still UB according to the standard, but here the undefined-ness, thanks to the implementation, results in correct working.Salinometer
@Jongware: it is UB: you cannot take that code and move it to an arbitrary architecture, most notably to a !MMU (non-paged based) architecture, where a string may end at the end of the data segment which may have a byte-granularity limit set at the MPU.Brazee
@ninjalj: I get your point, but this is only FreeBSD's version of strlen. I suppose any architectures or compilers where this problem might happen supply their own, adjusted, libraries. (And regularly have to explain "no, this one trick will not work on our system".)Stanleigh
That's why FreeBSD developers use very specific compilers to build FreeBSD: so that all "UB"-parts actually has (implementation-)defined behaviour they can rely on.Ergonomics
S
4

I don't see at all why alignment would be any relevant if the array is not long enough and we are reading past its end.

The routine starts with aligning to a word boundary for two reasons: first, reading words from an aligned address is faster on most architectures (and it's also mandatory on a few CPUs). The speed increase is enough to use the same trick in a host of similar operations: memcpy, strcpy, memmove, memchr, etc.

Second: if you continue reading words starting at a word boundary, you are assured the rest of the string resides in the same memory page. A string (including its terminating zero) cannot straddle a memory page boundary, and neither can reading a word. (1)

So this is always fastest and safest, even if the memory page granularity is sizeof(LONG_BIT) (which it isn't).

Picking up an entire word near the end of a string may pick up additional bytes after the final zero, but reading Undefined Bytes from valid memory is not UB -- only acting upon its contents is (2). If the word contains a zero terminator anywhere inside, the individual bytes are inspected with test_byte, and this, as is shown in the original source, will never act on bytes after the terminator.

(1) Obviously they can, but I meant "never into a locked page" or something similar.

(2) Under Discussion. See (sorry about that!) under Sneftel's answer.

Stanleigh answered 1/12, 2013 at 13:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.