What is wrong with this C function to find the endianness of a machine at runtime?
Asked Answered
A

8

8

This is what I offered at an interview today.

int is_little_endian(void)
{
    union {
        long l;
        char c;
    } u;

    u.l = 1;

    return u.c == 1;
}

My interviewer insisted that c and l are not guaranteed to begin at the same address and therefore, the union should be changed to say char c[sizeof(long)] and the return value should be changed to u.c[0] == 1.

Is it correct that members of a union might not begin at the same address?

Atal answered 20/8, 2009 at 2:45 Comment(0)
F
6

You are correct in that the "members of a union might begin at the same address". The relevant part of the Standard is (6.7.2.1 para 13):

The size of a union is sufficient to contain the largest of its members. The value of at most one of the members can be stored in a union object at any time. A pointer to a union object, suitably converted, points to each of its members (or if a member is a bit-field, then to the unit in which it resides), and vice versa.

Basically, a start address of the union is guaranteed to be the same as the start address of each of its members. I believe (still looking for the reference) that a long is guaranteed to be larger than a char. If you assume this, then your solution should* be valid.

* I'm still a little uncertain due to some interesting wording around the representation of integer and, in particular, signed integer types. Take a close read of 6.2.6.2 clauses 1 & 2.

Frolick answered 20/8, 2009 at 3:41 Comment(0)
W
8

I was unsure about the members of the union, but SO came to the rescue.

The check can be better written as:

int is_bigendian(void) {
    const int i = 1;
    return (*(unsigned char*)&i) == 0;
}

Incidentally, the C FAQ shows both methods: How can I determine whether a machine's byte order is big-endian or little-endian?

Withdrew answered 20/8, 2009 at 2:52 Comment(8)
I believe the hairy pointer casting is technically undefined behavior, but I couldn't cite anything, and it should certainly work on most machines.Celesta
I'd be surprised if it were undefined; otherwise how would memcpy and most serialization code work?Schmitt
@Chris I believe you have it reversed. Converting from a char * to int * can cause undefined behavior. I have a copy of the WG14/N1124 draft and if things haven't changed since then: When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object. (p.47, open-std.org/JTC1/SC22/wg14/www/docs/n1124.pdf)Fractionate
Okay. I don't have a copy (I'll get around to it one day) but I remembered hearing that the same trick from float to int in the Quake inverse square root function was undefined. I suppose converting between chars and ints is much more predictable, and thus defined.Celesta
@Chris clarification: Converting from a char * to int * would be undefined behavior if the two have different alignment requirements. But converting from any pointer type to char * is safe.Fractionate
@Chris: char is actually a special case in the standard, as a way of accessing the underlying representation of the other types.Knobloch
@CHris: "Hairy pointer casts", aka raw memory reinterpretation, are generally UB, except if you reinterpret it as an array of characters. The latter is explictly allowed in C. However, when char is used (as opposed to unsigned char) the set of things you can do with reinterpreted memory is limited. The above code is generally UB, since it is UB to read the value through such a char * pointer - the value might be a trap representation. The proper code should have used a cast to unsigned char*.Giselle
@caf: That would be unsigned char, not char.Giselle
F
6

You are correct in that the "members of a union might begin at the same address". The relevant part of the Standard is (6.7.2.1 para 13):

The size of a union is sufficient to contain the largest of its members. The value of at most one of the members can be stored in a union object at any time. A pointer to a union object, suitably converted, points to each of its members (or if a member is a bit-field, then to the unit in which it resides), and vice versa.

Basically, a start address of the union is guaranteed to be the same as the start address of each of its members. I believe (still looking for the reference) that a long is guaranteed to be larger than a char. If you assume this, then your solution should* be valid.

* I'm still a little uncertain due to some interesting wording around the representation of integer and, in particular, signed integer types. Take a close read of 6.2.6.2 clauses 1 & 2.

Frolick answered 20/8, 2009 at 3:41 Comment(0)
S
3

While your code would probably work in many compilers the interviewer is right -- how to align fields in a union or struct is entirely up to the compiler and in this case the char could be placed either at the "beginning" or the "end". The interviewer's code leaves no room for doubt and is guaranteed to work.

Standpoint answered 20/8, 2009 at 2:54 Comment(0)
C
1

The standard says the offsets for each item in a union are implementation defined.

When a value is stored in a member of an object of union type, the bytes of the object representation that do not correspond to that member but do correspond to other members take unspecified values. ISO/IEC 9899:1999 Representation of Types 6.5.6.2, para 7 (pdf file)

Therefore it's up to the compiler to choose where to put the char relative to the long within the union- they are not guaranteed to have the same address.

Cyclist answered 20/8, 2009 at 2:52 Comment(4)
There is one exception here. A little further down (6.7.2.1 para 13): "The size of a union is sufficient to contain the largest of its members. The value of at most one of the members can be stored in a union object at any time. A pointer to a union object, suitably converted, points to each of its members (or if a member is a bit-field, then to the unit in which it resides), and vice versa." Basically, a start address of the union is guaranteed to be the same as the start address of each of its members.Frolick
Good point, I'll cease meddling with fbrereton's question. I am confused now though, because if you're right, than the code in the question should work.Gosh
The OP's code is fine: See https://mcmap.net/q/475948/-union-element-alignmentFractionate
I'm pretty sure that it will work and is guaranteed to do so. See my answer... I was sorta surprised by this one.Frolick
C
0

I have a question about this...

how is

u.c[0] == anything

valid given:

union {
    long l;
    char c;
} u;

How does [0] work on a char?

Seems to me, it would be equivalent to: (*u.c + 0) == anything, which would be, well, crap, considering the value of u.c, treated as a pointer, would be crap.

(Unless perhaps, as it occurs to me now, some html crap code ate an ampersand in the original question...)

Cartagena answered 20/8, 2009 at 3:6 Comment(5)
The interviewer said that char c; should be char c[sizeof(long)];, thus u.c[0] would be valid.Celesta
Ah, ok, that makes sense. Jesus inteviews suck.Cartagena
I would have done it: int x = 0x01020304; unsigned char *x = (char *) &x; return x[0] == 0x01;Cartagena
And I would have been dinged for not using uint32_t, and the wrong cast. LOL. (Have had a beer or two since getting off work.)Cartagena
Not to mention two varibles called 'x'. Cripes.Cartagena
G
0

While the interviewer is correct and this is not guaranteed to work by the spec, none of the other answers are guaranteed to work either, as dereferencing a pointer after casting it to another type yields undefined behavior.

In practice, this (and the other answers) will always work, as all compilers allow casting between pointer-to-union and pointer-to-member-of-union transparently -- much ancient code will fail to work if they did not.

Geometrize answered 20/8, 2009 at 3:9 Comment(1)
Neither clang nor gcc will reliably handle any accesses to non-character-type union members which involve taking the address and dereferencing them, unless the access takes the form of an array-element access using bracketed subscript notation. Even a statement like *(myUnion.intArray+i) = 23; will not be recognized as potentially affecting the value of *(myUnion.floatArray+j).Elmiraelmo
C
0

correct me if I am wrong but local variables are not initialized to 0;

this is not better:

union {
    long l;
    char c;
} u={0,};
Constitutionalism answered 21/10, 2009 at 16:45 Comment(0)
E
0

A point not yet mentioned is that the standard explicitly allows for the possibility that integer representations may contain padding bits. Personally I wish the standards committee would allow a nice easy way for a program to specify certain expected behaviors, and require that any compiler must either honor such specifications or refuse compilation; code which started with an "integers must not have padding bits" specification would then be entitled to assume that to be the case.

As it is, it would be perfectly legitimate (albeit odd) for an implementation to store 35-bit long values as four 9-bit characters in big-endian format, but use the LSB of the first byte as a parity bit. Under such an implementation, storing 1 into a long could cause the parity of the overall word to become odd, thus compelling the implementation to store a 1 into the parity bit.

To be sure, such behavior would be odd, but if architectures that use padding are sufficiently notable to justify explicit provisions in the standard, code which would break on such architectures can't really be considered truly "portable".

The code using union should work correctly on all architectures which can be simply described as "big-endian" or "little-endian" and do not use padding bits. It would be meaningless on some other architectures (and indeed the terms "big-endian" and "little-endian" could be meaningless too).

Elmiraelmo answered 10/3, 2015 at 17:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.