(Disclaimer: I've seen this question, and I am not re-asking it -- I am interested in why the code works, and not in how it works.)
So here's this implementation of Apple's (well, FreeBSD's) strlen()
. It uses a well-known optimization trick, namely it checks 4 or 8 bytes at once, instead of doing a byte-by-byte comparison to 0:
size_t strlen(const char *str)
{
const char *p;
const unsigned long *lp;
/* Skip the first few bytes until we have an aligned p */
for (p = str; (uintptr_t)p & LONGPTR_MASK; p++)
if (*p == '\0')
return (p - str);
/* Scan the rest of the string using word sized operation */
for (lp = (const unsigned long *)p; ; lp++)
if ((*lp - mask01) & mask80) {
p = (const char *)(lp);
testbyte(0);
testbyte(1);
testbyte(2);
testbyte(3);
#if (LONG_BIT >= 64)
testbyte(4);
testbyte(5);
testbyte(6);
testbyte(7);
#endif
}
/* NOTREACHED */
return (0);
}
Now my question is: maybe I'm missing the obvious, but can't this read past the end of a string? What if we have a string of which the length is not divisible by the word size? Imagine the following scenario:
|<---------------- all your memories are belong to us --------------->|<-- not our memory -->
+-------------+-------------+-------------+-------------+-------------+ - -
| 'A' | 'B' | 'C' | 'D' | 0 |
+-------------+-------------+-------------+-------------+-------------+ - -
^ ^^
| ||
+------------------------------------------------------++-------------- - -
long word #1 long word #2
When the second long word is read, the program accesses bytes that it shouldn't in fact be accessing... isn't this wrong? I'm pretty confident that Apple and the BSD folks know what they are doing, so could someone please explain why this is correct?
One thing I've noticed is that beerboy asserted this to be undefined behavior, and I also believe it indeed is, but he was told that it isn't, because "we align to word size with the initial for loop" (not shown here). However, I don't see at all why alignment would be any relevant if the array is not long enough and we are reading past its end.
char
is typically ASCII (codes 0 - 127) is itself noteworthy. – Henning