Indexing an `unsigned long` variable and printing the result
Asked Answered
C

2

32

Yesterday, someone showed me this code:

#include <stdio.h>

int main(void)
{
    unsigned long foo = 506097522914230528;
    for (int i = 0; i < sizeof(unsigned long); ++i)
        printf("%u ", *(((unsigned char *) &foo) + i));
    putchar('\n');

    return 0;
}

That results in:

0 1 2 3 4 5 6 7

I am very confused, mainly with the line in the for loop. From what I can tell, it seems like &foo is being cast to an unsigned char * and then being added by i. I think *(((unsigned char *) &foo) + i) is a more verbose way of writing ((unsigned char *) &foo)[i], but this makes it seem like foo, an unsigned long is being indexed. If so, why? The rest of the loop seems typical to printing all elements of an array, so everything seems to point to this being true. The cast to unsigned char * is further confusing me. I tried searching about casting integer types to char * specifically on google, but my research got stuck after some unhelpful search results about casting int to char, itoa(), etc. 506097522914230528 specifically prints out 0 1 2 3 4 5 6 7, but other numbers appear to have their own unique 8 numbers shown in the output, and bigger numbers seem to fill in more zeroes.

Courson answered 16/2, 2021 at 14:21 Comment(24)
Convert 506097522914230528 to hexadecimal, it will make more sense.Thousand
And think little endian.Jeanettajeanette
@harold you're right, it is showing 706050403020100. Does that mean I'm treating this long like some sort of array by converting its address to a char * and dereferencing it?Courson
@Courson Bingo!Editorial
@Courson I like your characterization of this as "so, so messed up", but at the same time, what this exercise demonstrates is a pretty powerful and fundamental concept. Deep down, everything is just a blob (or, of you prefer, an array) of bytes. And in C a char * or an unsigned char * can access any byte in your address space that you're allowed to access.Lenient
This question is being discussed on meta.Spacious
I wouldn't even call it "messed up". Every object has an "object representation" that you can read via char*. That's how stuff like memcpy can copy any object (logically 1 char at a time, in practice with wider loads/stores.) And one way to write code that serializes data into a byte-stream (with native endianness.)Psittacine
It's also how stuff like SIMD intrinsics for accessing C objects work (_mm_loadu_si128( pointer ) - like char* accesses, they can safely access anything without violating strict-aliasing rules. Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?)Psittacine
@PeterCordes yes, that does seem to make things a bit more understandable. I was just so surprised when I realized what was going on. Being one byte in size, chars and char *s are definitely useful.Courson
Yeah, and even more importantly for this, char* is allowed to read any other type of object without triggering Undefined Behaviour (because of a special exception for it and unsigned char* in the strict aliasing part of the ISO C standard). Note that the reverse is not true; using unsigned long* to read through a char buf[] is still UB. (see Why does glibc's strlen need to be so complicated to run quickly? for a way to get around that with GNU C __attribute__((may_alias)) on a typedef, or using memcpy)Psittacine
@PeterCordes Yes, I had seen this post before, but I hadn't noticed that trick that was pulled. I guess I can understand why casting a char * to a long * is normally undefined behavior though; a char is one byte and a long is 8 (or sometimes 4), so unless a char[] size is a multiple of 8, you would end up getting some bytes that were not originally part of the char. And I guess a char * is safe from strict-aliasing because ultimately, everything is made out of bytes. I don't think you can store values in nybbles or anything smaller than a byte in modern systems.Courson
@mediocrevegetable1: There's no reason why C needs the strict-aliasing rule, other than making optimization easier sometimes (by type-based alias analysis). _Alignas(long) char buf[sizeof(long)]; is guaranteed to be exactly the same size as a long (and sufficiently aligned), but it's still not safe to point a long* at it and load from it. You can safely do the exact same type-punning in C99 using a union. It's just a quirk of C and C++ that pointer-casting type punning is automatically UB except for the special case of char / unsigned char; some other languages are different.Psittacine
@PeterCordes Ah, I see. I hadn't known about _Alignas before this.Courson
(Reading off the end of an array is UB for other reasons, strict aliasing isn't needed to forbid it.) But yes, sizeof(char) is 1 by definition. You could imagine a 4-bit CPU architecture where satisfying the C requirement for the value-range of unsigned char might require 2 separate 4-bit registers / memory locations to be grouped together as a char by an ISO C implementation... But that's not practical; 8-bit bytes are standard these days, and smaller was rare historically. Related: Can modern x86 hardware not store a single byte to mem?Psittacine
Also note that some C implementations don't enforce the strict-aliasing rule, e.g. MSVC always, or GCC with -fno-strict-aliasing. MS even recommends *(float*)&my_int32 as a way to type-pun an int holding a bit-pattern into a float. (Their compiler optimizes memcpy ok, I think, so writing non-portable crap like that just locks you in to continuing to use MSVC, with no benefit in the resulting asm. Although it is compact, only C++20 std::bit_cast is more readable.) Always remember that a specific C implementation can choose to define any behaviour that ISO C leaves undefined.Psittacine
@PeterCordes That's interesting to know, I've never used MSVC before (nor do I think I will, at least for a long time). It makes things very confusing, but I don't think I know enough about strict aliasing and pointers to judge if it is a good thing or not (Though a link in the comments of my answer mentions the pros of strict aliasing for a compiler #99150).Courson
Usually you only want to do stuff like this for wide loads from narrow data to write your own strlen or whatever in C using bithacks. OS kernels are often compiled with -fno-strict-aliasing because they tend to want to mess around with the same memory different ways, and often aren't careful to do it only using memcpy, char*, or GNU C __attribute__((may_alias)) typedefs. Strict aliasing can let a compiler optimize better sometimes, e.g. knowing that an int* store definitely won't change what's read from a float*.Psittacine
related: blog.llvm.org/2011/05/what-every-c-programmer-should-know.html discusses why UB gives compilers license to optimize.Psittacine
PS Here's my standard (re)search comment: Before considering posting please read the manual & google any error message & many clear, concise & precise phrasings of your question/problem/goal, with & without your particular names/strings/numbers & 'site:stackoverflow.com' & tags; read many answers. If you post a question, use one phrasing as title. Reflect your research. See How to Ask & the voting arrow mouseover texts. We cannot reason, communicate or search unless we make the effort to (re-re-re-)write clearly.Meilen
There are many other Q&A like the duplicate like What are the rules for casting pointers in C? including some more specific like What actually happens when a pointer to integer is cast to a pointer to char? but the duplicate has answers that mention the important technical terms implementation-defined behavior & undefined behavior.Meilen
@Meilen I approved the duplicate decision, mainly because due to the nature and phrasing of this question, it would be unlikely that people would stumble upon this question in the future. Nonetheless, the comment section of this post is useful and discusses many things in great detail while also providing helpful links. By approving the duplicate, I hope people in the future will stumble upon this post and (hopefully) learn something from the vast number of comments on this post.Courson
This is not an invalid cast, it is not a strict aliasing violation and not undefined behavior. The only thing that's implementation defined here is size of long and endianess. The duplicate is just plain wrong. I'll rollback and re-open.Mismate
The post is reasonably closed as duplicate of cast to unsigned char * semantics. (The first link in my last comment.) Probably the last duplicate used was a poor choice because although its answers answered this post its question was about undefined behaviour.Meilen
@PeterCordes Notably, the original rationale for strict aliasing was that if you had something like a function taking pointer to double, the compiler shouldn't need to worry if lvalue access to that double somehow made changes to the value of some external linkage int visible in the same translation unit. A very sound rationale. Then it all went haywire when people started to apply those same rules to integers of different size. And partially accessing an integer through a smaller type is a very common use-case, particularly in hardware-related programming. So these rules remain broken.Mismate
C
39

As a preface, this program will not necessarily run exactly like how it does in the question as it exhibits implementation-defined behavior. In addition to this, tweaking the program slightly can cause undefined behavior as well. More information on this at the end.

The first line of the main function defines an unsigned long foo as 506097522914230528. This seems confusing at first, but in hexadecimal it looks like this: 0x0706050403020100.

This number consists of the following bytes: 0x07, 0x06, 0x05, 0x04, 0x03, 0x02, 0x01, 0x00. By now, you can probably see its relation to the output. If you're still confused about how this translates into the output, take a look at the for loop.

for (int i = 0; i < sizeof(unsigned long); ++i)
        printf("%u ", *(((unsigned char *) &foo) + i));

Assuming a long is 8 bytes long, this loop runs eight times (remember, two hex digits are enough to display all possible values of a byte, and since there are 16 digits in the hex number, the result is 8, so the for loop runs eight times). Now the real confusing part is the second line. Think about it this way: as I previously mentioned, two hex digits can show all possible values of a byte, right? So then if we could isolate the last two digits of this number, we would get a byte value of seven! Now, assume the long is actually an array which looks like this:

{00, 01, 02, 03, 04, 05, 06, 07}

We get the address of foo with &foo, cast it to an unsigned char * to isolate two digits, then use pointer arithmetic to basically get foo[i] if foo is an array of eight bytes. As I mentioned in my question, this probably looks less confusing as ((unsigned char *) &foo)[i].


A bit of a warning: This program exhibits implementation-defined behavior. This means that this program will not necessarily work the same way/give the same output for all implementations of C. Not only is a long 32 bits in some implementations, but when we declare the unsigned long, the way/order in which it stores the bytes of 0x0706050403020100 (AKA endianness) is also implementation-defined. Credit to @philipxy for pointing out the implementation-defined behavior first. This type punning causes another issue which @Ruslan pointed out, which is that, if the long is casted to anything other than a char */unsigned char *, C's strict aliasing rule comes into play and you will get undefined behavior (Credit of the link goes to @Ruslan as well). More detail on these two points in the comment section.

Courson answered 16/2, 2021 at 14:49 Comment(21)
And for extra credit, try changing the number to 2314886970912564552, and the printf format to %c. Or maybe 7308324466019755382.Lenient
@SteveSummit I tried both of them, nice one! I especially like the last one :)Courson
For this program to be meaningful (for example in the sense you describe) certain implementation-defined behaviour has to be defined by the implementation, but you don't discuss or identify it.Meilen
@Meilen Do you mean that this behavior has to be defined in the C standard? As far as I know, casting by pointer is generally undefined behavior, but I'll see if I can find any information on it. Thank you for clarifying the problem.Courson
@Meilen I found this quote from this link port70.net/%7Ensz/c/c11/n1570.html#6.3.2.3 "An integer may be converted to any pointer type. Except as previously specified, the result is implementation-defined, might not be correctly aligned, might not point to an entity of the referenced type, and might be a trap representation." Is this relevant?Courson
Again: The program only has meaning if certain "implementation defined behaviour" is defined in a certain way by the implementation. That's a C technical term, research it. It is relevant to your answer in that your answer claims without justification that the program does a certain thing & it would only be justified under certain implementation-defined circumstances. If you think the language is defined to act per your post, you are wrong. Of course the author of the code has such expectations & it affected their writing that code, whether that was appropriate for them to expect or not.Meilen
@Meilen I think I get what you mean now. I'll update my answer to note that what happens in it is not necessarily true for everyone/every implementation. Thanks.Courson
Note that, if you do type punning to other types than unsigned char or char, you may easily get undefined behavior, rather than merely implementation-defined one. This is due to strict aliasing rules of C and C++.Roshelle
@Roshelle interesting, I had never heard of strict-aliasing before. I'll add a brief note in the answer too.Courson
@Roshelle You can get undefined behavior when doing type punning for other reasons, too: "A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined." Note that dereferencing the pointer is not necessary to invoke undefined behavior - the mere conversion is sufficient. So given char *p, this can be UB for reasons in addition to strict aliasing: int64_t *q = (int64_t *)(&p[n]);Animated
@AndrewHenle: Fortunately, _Alignof(char) is guaranteed to be 1, same as sizeof(char), so it's always safe to create and even deref an unsigned char* to an object. Also note that while ISO C doesn't define the behaviour of creating a misaligned pointer, some implementations do define it (e.g. because they'd have to go out of their way to break such code, and because it's required by some extensions, like for Intel's SIMD intrinsic). Of course deref of a misaligned int64_t is unsafe even on x86 because of UB.Psittacine
@PeterCordes That would be the last part of 6.3.2.3p7: "When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object. Successive increments of the result, up to the size of the object, yield pointers to the remaining bytes of the object." That wouldn't fit into my original ocmment. And thanks for another example of misaligned access failing on x86. Those are always useful for the "see-no-evil" delusionals who insist on doing misaligned accesses "because it works!"Animated
Not sure what all these comments about alignment and strict aliasing are for. Sure, those are issues with types other than unsigned char, but this example does use unsigned char, so it's fine in that respect. The example is indeed implementation-defined in that unsigned long might not be 64 bits wide, and might not be little-endian, so if you're going to criticize it, please do so on that basis.Lenient
Assuming a long is 8 bytes long, and a(n unsigned) char is 1 bytes char! (ducks to take cover)Calipee
The other main things to research are "the object model", "casting", "pointers" & "arrays". PS In the question you could at least give a researched take on what (unsigned char *) &foo means & requires (what circumstances are required for what meanings), or how you are stuck doing so, and so on until you are stuck. If you can't report what you think is a clear understanding, why post a whole program? Some might otherwise consider the question unfocused. And lack of research is downvotable. PS p+n is more fundamental than p[n]; an array is just a sequence of one or more contiguous objects.Meilen
@Meilen as in I should edit my question to clarify exactly what I understand by *(((unsigned char *) &foo) + i) and what effort I have made to further understand this? And I am aware p[n] is just syntactic sugar over *(p + n), I just wanted to clarify in my answer that this was the exact same as doing ((unsigned char *) &foo)[i] just to make it easier to look at and understand.Courson
Did you research authoritative documentation & google re fragments of the code (starting with char & unsigned char & char * & unsigned char * & casts of those) & char & unsigned char & casting to pointer to char etc? (Rhetorical.) You should very very quickly find out why casting to a char * is done & what it allows you to do, etc etc & also hit the technical terms I said you should now research. Googling 'site:stackoverflow.com "c" -"c++" indexing into an unsigned long using pointer to char' these immediately for me hit answers to extracting bytes from sequences representing integers.Meilen
In answer to your last comment, yes. Yes yes. How to Ask help center Meta Stack Overflow Meta Stack Exchange minimal reproducible example The idea from my preceding comment is, focus your question. Don't just effectively ask for yet another definition of & introduction to (the parts used of) the language with a bespoke tutorial.Meilen
@Meilen to be honest, I definitely did not go as deep into searching as you did (I hadn't even realized you could focus on one site specifically). Mainly, I focused on the "casting integer types to char *" part, which was not very helpful (I mostly got results that showed casting from integer to char or functions like itoa()). I couldn't really find much based on even the casting to char *, let alone words like implementation-defined behavior and strict aliasing that are being thrown around now.Courson
@Meilen I hadn't even found this, this will hopefully be useful. I'll look out for questions that are similar to mine, so thank you for providing this link. As for now, I'll try to improve upon my answer, show what I tried, etc. Thanks for all the help.Courson
@Meilen edited my question, so hopefully, it is more focused now.Courson
M
11

There's already an answer explaining what the code does, but since this post for some reason is getting a lot of strange attention and getting repeatedly closed for the wrong reasons, here's some more insights on what the code does, what C guarantees and what it does not guarantee:


  • unsigned long foo = 506097522914230528;. This integer constant is 506 * 10^15 large. That one may or may not fit inside an unsigned long, depending on if long is 4 or 8 byte large on your system (implementation-defined).

    In case of 4 byte long, this will get truncated to 0x03020100 1).

    In case of 8 byte long, it can handle numbers up to 18.44 * 10^18 so the value will fit.

  • ((unsigned char *) &foo) is a valid pointer conversion and well-defined behavior. C17 6.3.2.3/7 makes this guarantee:

    A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined. Otherwise, when converted back again, the result shall compare equal to the original pointer.

    The concern about alignment does not apply since we have a pointer to character.

    If we keep reading 6.3.2.3/7:

    When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object. Successive increments of the result, up to the size of the object, yield pointers to the remaining bytes of the object.

    This is a special rule allowing us to inspect any type in C through a character type. Whether the successive increments is done by a pointer++ or by pointer arithmetic pointer + i doesn't matter. As long as we keep pointing within the inspected object, which i < sizeof(unsigned long) ensures. This is well-defined behavior.

  • Another special rule "strict aliasing" that was mentioned contains a similar exception for characters. It is in sync with the 6.3.2.3/7 rule. Specifically, "strict aliasing" allows (C17 6.5/7):

    An object shall have its stored value accessed only by an lvalue expression that has one of the following types:
    ...

    • a character type.

    The "stored object" in this case is unsigned long and should normally only get accessed as such. However, when the unsigned char* is de-referenced with * we access it as a character type. This is allowed by the exception to the strict aliasing rule mentioned above.

    As a side note, the other way around, accessing an array of unsigned char arr[sizeof(long)] through an *(unsigned long*)arr lvalue access would have been a strict aliasing violation and undefined behavior. But this is not the case here.

  • Using %u to print a character is strictly speaking not correct since printf then expects an unsigned int. However, since printf is a variadic function, it comes with some oddball implicit promotion rules that makes this code well-defined. The unsigned char value will get promoted by the default argument promotions 2) to type int. printf then internally re-interprets this int as unsigned int. It can't be a negative value because we started from unsigned char. The conversion3) is well-defined and portable.

  • So we get the byte values one by one. The hex representation is 07 06 05 04 03 02 01 00 but how this is stored in an unsigned long is CPU specific/implemention-defined behavior. Which in turn is a very common FAQ, see What is CPU endianness? which contains a very similar example to this code.

    On little endian it will print 1 2..., on big endian it will print 7 6....


1) See the unsigned integer conversion rule C17 6.3.1.3/2.
2) C17 6.5.2.2/6.
3) C17 6.3.1.3/1 "When a value with integer type is converted to another integer type other than _Bool, if the value can be represented by the new type, it is unchanged."

Mismate answered 18/2, 2021 at 15:36 Comment(2)
Hello @Lundin, and thank you for providing an answer which addresses some of the comments with relevant sources. I hadn't realised that truncation occurs as opposed to overflow when you define a variable with a value too large. Thanks to the sizeof(unsigned long) in the for loop condition, that should mean that even on a system with a 32-bit long, this program should print 0 1 2 3 or 3 2 1 0 (depending on endianness, as you mentioned), right?Courson
@Courson Strictly speaking there's some mathematical modulus "Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type." but in this case it is the same as truncating. Unsigned variables cannot overflow, only wrap-around. Had you used signed variables it would have been another story.Mismate

© 2022 - 2024 — McMap. All rights reserved.