A question about union in C - store as one type and read as another - is it implementation defined?
Asked Answered
F

6

39

I was reading about union in C from K&R, as far as I understood, a single variable in union can hold any one of the several types and if something is stored as one type and extracted as another the result is purely implementation defined.

Now please check this code snippet:

#include<stdio.h>

int main(void)
{
  union a
  {
     int i;
     char ch[2];
  };

  union a u;
  u.ch[0] = 3;
  u.ch[1] = 2;

  printf("%d %d %d\n", u.ch[0], u.ch[1], u.i);

  return 0;
}

Output:

3 2 515

Here I am assigning values in the u.ch but retrieving from both u.ch and u.i. Is it implementation defined? Or am I doing something really silly?

I know it may seem very beginner to most of other people but I am unable to figure out the reason behind that output.

Thanks.

Foetus answered 28/11, 2009 at 11:54 Comment(4)
512=256x2+3. On Intel processors, lower bytes are before high so ch[0] is the high byte of a 2-byte integer. Btw, you are assigning numeric values to char variables. I would at least expect a warning about that.Puett
@Workshop Alex You mean u.ch[0]=3;? Why should you have a warning about that? char is only the shortest of integer types, why should it be prevented to receive values written in decimal? Nothing prevents to use int x='c'; either. Which of "signed char" or "unsigned char" should be reserved for ASCII codes, in your interpretation, and what use would there be for the other?Norman
@Alex: this is C, C++ has stronger type checking. In C, it is perfectly valid to assign integers to char variables. In fact a literal char is an int. Try this, in both C and C++: printf("sizeof literal char: %d\n", (int)sizeof 'X');.Liatrice
@Liatrice It's perfectly valid to assign integers to char variables in C++ too. :-)Conchaconchie
B
33

This is undefined behaviour. u.i and u.ch are located at the same memory address. So, the result of writing into one and reading from the other depends on the compiler, platform, architecture, and sometimes even compiler's optimization level. Therefore the output for u.i may not always be 515.

Example

For example gcc on my machine produces two different answers for -O0 and -O2.

  1. Because my machine has 32-bit little-endian architecture, with -O0 I end up with two least significant bytes initialized to 2 and 3, two most significant bytes are uninitialized. So the union's memory looks like this: {3, 2, garbage, garbage}

    Hence I get the output similar to 3 2 -1216937469.

  2. With -O2, I get the output of 3 2 515 like you do, which makes union memory {3, 2, 0, 0}. What happens is that gcc optimizes the call to printf with actual values, so the assembly output looks like an equivalent of:

    #include <stdio.h>
    int main() {
        printf("%d %d %d\n", 3, 2, 515);
        return 0;
    }
    

    The value 515 can be obtained as other explained in other answers to this question. In essence it means that when gcc optimized the call it has chosen zeroes as the random value of a would-be uninitialized union.

Writing to one union member and reading from another usually does not make much sense, but sometimes it may be useful for programs compiled with strict aliasing.

Bitolj answered 28/11, 2009 at 12:0 Comment(11)
I am almost convinced that the behavior is implementation defined, but the source of the problem is making me think other-wise.Did you really tried the code in a compiler ?Foetus
Yes, in fact output for me is -1216937469Bitolj
OK, in -O2 case gcc passes constants 2, 3 and 515 on the stack to printf, which is what it thinks the union would contain (the union is optimized out). That's not the case with -O0, however!Bitolj
To be pedantic, actually it's undefined behavior, not an implementation-defined result. The difference is that "implementation-defined" in the standard means that there must be a result, and the implementation must document what that result is. Undefined means the implementation is allowed to just crash, do something random, or whatever. "whatever" permits "do something sensible, and document what that is". In practice, implementations always do something sensible in this case and document what it is, because it's such a widely-used trick. So it appears implementation-defined.Phidippides
There is an exception in the standard for unions of structs which have a common initial sequence of members, which provides defined behaviour. See 6.5.2.3/5. Also, since pretty much all implementations don't have trap representations for integer types, those are pretty safe. But it is legal for int to have padding bits, in which case assigning to the char array in that union could create a trap representation (or the unassigned bytes could include padding bits). The attempt to print it would then be undefined behaviour.Phidippides
@Alex: while a quick search of the Standard hasn't turned up any specific requirement for union alignment, structs and classes are guaranteed to put the first data element at the starting memory address, aligning the entire type based on it's requirements, and every implementation I've ever heard of will do something similar for unions: align the union based on the strictest requirements of any member type. The first paragraph of your answer implies the result is different because of different alignment - that's not true - it's different because of the attempted read of uninitialised memory.Archeozoic
@Tony I'm not sure what I meant when I was writing this answer. I rephrased it better now.Bitolj
This answer is incorrect. In C 1999 and C 2011, reading a union member other than the last stored member is not per se undefined. The bytes are reinterpreted in the new type. The specific details are implementation-defined, not undefined. This may result in a trap representation, causing undefined behavior, but that is a consequence of the new value, not of the union member access and, depending on the specific types involved, may be fully defined by the standard.Othilia
C 1999 was specifically changed for this in Technical Corrigendum 3, per this defect report.Othilia
TC3 doesn't change the interpretation that was valid previously, it is just a clarification of the behavior that the committee wanted from the start.Latimore
On normal platforms this is unspecified behaviour, not undefined. See 6.2.6.1/7 (same in C99 and C11 I think) (It could be undefined on a platform where there is a possible trap representation for u.i)Loudmouth
B
20

The answer to this question depends on the historical context, since the specification of the language changed with time. And this matter happens to be the one affected by the changes.

You said that you were reading K&R. The latest edition of that book (as of now), describes the first standardized version of C language - C89/90. In that version of C language writing one member of union and reading another member is undefined behavior. Not implementation defined (which is a different thing), but undefined behavior. The relevant portion of the language standard in this case is 6.5/7.

Now, at some later point in evolution of C (C99 version of language specification with Technical Corrigendum 3 applied) it suddenly became legal to use union for type punning, i.e. to write one member of the union and then read another.

Note that attempting to do that can still lead to undefined behavior. If the value you read happens to be invalid (so called "trap representation") for the type you read it through, then the behavior is still undefined. Otherwise, the value you read is implementation defined.

Your specific example is relatively safe for type punning from int to char[2] array. It is always legal in C language to reinterpret the content of any object as a char array (again, 6.5/7).

However, the reverse is not true. Writing data into the char[2] array member of your union and then reading it as an int can potentially create a trap representation and lead to undefined behavior. The potential danger exists even if your char array has sufficient length to cover the entire int.

But in your specific case, if int happens to be larger than char[2], the int you read will cover uninitialized area beyond the end of the array, which again leads to undefined behavior.

Billen answered 28/11, 2009 at 16:24 Comment(7)
Are you sure this is correct? You can create a valid int by memcpy from another int, which is assembling it as unsigned char units (the representation). I believe it's equally valid to do this yourself as long as you have some way of ensuring that you create a valid representation. Note that the (very common) condition INT_MIN==-(2^(CHAR_BIT*sizeof(int)-1)) ensures that all representations are valid.Rube
@R..: I don't understand the question. I'm talking about the abstract general case. Under the specific conditions you mention (even though they are very widespread) it might be valid, but in general case it is not. In general case int might have invalid representations.Billen
"Writing data into the char array member of your union and then reading it as an int is, again, undefined behavior." not "is", "might be", depending on the data.Ginger
This keeps confusing me to no end. Apparently: the original C99 said in the main text and in Annex J that reading a different member than the last member stored to is unspecified behaviour (not undefined), TC3 changed the main text as per DR283 (open-std.org/jtc1/sc22/wg14/www/docs/dr_283.htm) to specify that the bytes corresponding to the last stored-to member are implementation-defined (but what happened to Annex J?), and C1x finally changed both the main text and Annex J. Anyone has access to TC3?Multicolored
@ninjalj: You are correct. TC3 note 82 says “If the member used to access the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called "type punning"). This might be a trap representation.” This answer is wrong; it is not per se undefined behavior to read from a union member other than the last one stored.Othilia
@ninjalj: I was answering the question thinking about the language described in K&R book (since the OP mentioned K&R book), which is C89/90. In C89/90 doing this is undefined. In C99+TC3 it became implementation-defined. I updated the answer to reflect that distinction.Billen
@AnT: Under C89, the behavior was Implementation-Defined. The authors of DR#028 seem to have thought such action invoked Undefined behavior, but 3.3.2.3 of the C89 draft says "With one exception, if a member of a union object is accessed after a value has been stored in a different member of the object, the behavior is implementation-defined." I'm not sure how anyone could read that as saying the behavior is undefined, since it seems rather explicit that it isn't.Abbieabbot
M
9

The reason behind the output is that on your machine integers are stored in little-endian format: the least-significant bytes are stored first. Hence the byte sequence [3,2,0,0] represents the integer 3+2*256=515.

This result depends on the specific implementation and the platform.

Mane answered 28/11, 2009 at 12:5 Comment(3)
I really liked your answer.Thanks,Foetus
Technically undefined, not implementation-defined. The terms have different meanings in the standard.Phidippides
@SteveJessop Undefined even on platforms whose int lacks a trap representation? C99 TC3 allows type-punning.Concent
A
5

It is implementation dependent and results might vary on a different platform/compiler but it seems this is what is happening:

515 in binary is

1000000011

Padding zeros to make it two bytes (assuming 16 bit int):

0000001000000011

The two bytes are:

00000010 and 00000011

Which is 2 and 3

Hope someone explains why they are reversed - my guess is that chars are not reversed but the int is little endian.

Amount of memory allocated to a union is equal to the memory required to store the biggest member. In this case, you have an int and a char array of length 2. Assuming int is 16 bit and char is 8 bit, both require same space and hence the union is allocated two bytes.

When you assign three (00000011) and two (00000010) to the char array, the state of union is 0000001100000010. When you read the int from this union, it converts the whole thing into and integer. Assuming little-endian representation where LSB is stored at lowest address, the int read from the union would be 0000001000000011 which is the binary for 515.

NOTE: This holds true even if the int was 32 bit - Check Amnon's answer

Actinic answered 28/11, 2009 at 12:0 Comment(9)
My processor is little-endianFoetus
That was a mistake - I was referring to little endian though I typed big endian. This is what is happening - even if your int is 32 bit. See the update.Actinic
How will you explain ? int main(void) { union a{ int i; char ch[3]; }; union a u; u.ch[0] = 3; u.ch[1] = 2; u.ch[2] = 2; printf("%d %d %d\n",u.ch[0],u.ch[1],u.i); return 0; } Foetus
or 131842 - if you got one of these, i think i know whats happening - otherwise :(Actinic
No, in my machine the output is exactly the same as before.Foetus
What does printf("%d %d", sizeof(int), sizeof(char)) print?Actinic
my guess is that your int is 16 bit. I got 131587 as expected on my 32 bit int compiler (that printed 4 for sizeof int)Actinic
In the question paper it is given to take integer as 2 byte size.Foetus
Then it is clear. The union is allocated the size of char array (3 bytes) which is greater than the size of an int (2 bytes). When you read int from the union, it considers only the first two bytes (and inverts them thanks to little endian processor) and hence the result 515.Actinic
P
5

The output from such code will be dependent on your platform and C compiler implementation. Your output makes me think you're running this code on a litte-endian system (probably x86). If you were to put 515 into i and look at it in a debugger, you would see that the lowest-order byte would be a 3 and the next byte in memory would be a 2, which maps exactly to what you put in ch.

If you did this on a big-endian system, you would have (probably) gotten 770 (assuming 16-bit ints) or 50462720 (assuming 32-bit ints).

Partain answered 28/11, 2009 at 12:4 Comment(1)
How will you explain this:#include <stdio.h> int main(void) { union a{ int i; char ch[3]; }; union a u; u.ch[0] = 3; u.ch[1] = 2; u.ch[2] = 2; printf("%d %d %d\n",u.ch[0],u.ch[1],u.i); return 0; } ??Foetus
S
4

If you're on a 32-bit system, then an int is 4 bytes but you only initialise only 2 bytes. Accessing uninitialised data is undefined behaviour.

Assuming you're on a system with 16-bit ints, then what you are doing is still implementation defined. If your system is little endian, then u.ch[0] will correspond with the least significant byte of u.i and u.ch1 will be the most significant byte. On a big endian system, it's the other way around. Also, the C standard does not force the implementation to use two's complement to represent signed integer values, though two's complement is the most common. Obviously, the size of an integer is also implementation defined.

Hint: it's easier to see what's happening if you use hexadecimal values. On a little endian system, the result in hex would be 0x0203.

Sibley answered 28/11, 2009 at 12:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.