Union – useless anachronism or useful old school trick?
Asked Answered
L

12

21

I recently came across a great data structures book,"Data Structures Using C" (c) 1991, at a local Library book sale for only $2. As the book's title implies, the book covers data structures using the C programming language.

I got the book knowing it would be out-dated but would probably contain lots of advanced C topics that I wouldn't encounter elsewhere.

Sure enough within 5 minutes I found something I didn't know about C. I came across a section talking about the union keyword and I realized that I had never used it, nor ever seen any code that does. I was grateful for learning something interesting and quickly bought the book.

For those of you not knowledgeable about what a union is, the book uses a good metaphor to explain:

To fully understand the concept of a union, it is necessary to examine its implementation. A Structure may be regarded as a road map to an area of memory. It defines how the memory is to be interpreted. A union provides several different road maps for the same area of memory, and it is the responsibility of the programmer to determine which road map is in current use. In practice, the compiler allocates sufficient storage to contain the largest member of the union. It is the road map, however, that determines how that storage is to be interpreted.

I could easily come up with contrived situations or hacks where I would use a Union. (But I am not interested in contrived situations or hacks...)

Have you used or seen an implementation where using Union solved the problem **more elegantly** than not using a Union?

Added bonus if you include a quick explanation of why using union was better/easier than not using a union.

Landgraviate answered 13/5, 2009 at 13:47 Comment(5)
I feel like C++ has many tools available that would make unions obsolete...Landgraviate
@Trevor And if you're writing C, those tools aren't available to you.Velours
@Adam, 'if you're writing C' I agree you won't have those tools. But technically Unions are valid in C++ and that is why the question is tagged with C++. Even though in C++ the extra tools/language-features available would IMO make Unions obsolete.Landgraviate
But you also tagged it with C. So you need to look at what you're doing, C or C++. If you're using C, they aren't obsolete since you don't have those features. If you're using C++, they may be.Velours
@Adam - they're not obsolete, but they do need to be used with a great deal of care. The MISRA-C 2004 standard for C used in safety-critical systems requires that unions are not used (due to the compiler-dependent implementation).Tinkle
S
26

UNIONs implement some sort of polymorphism in a non-OOP world. Usually, you have a part which is common and depending on that part, you use the rest of the UNIONs. Therefore, in such cases where you do not have an OOP language and you want to avoid excessive pointer arithmetic, unions can be more elegant in some cases.

Starobin answered 13/5, 2009 at 13:51 Comment(2)
@Bastien Leonard, +1 for giving a link to a really good example code ( that also has a good explanation )!Landgraviate
@HappyYellowFace This should be the new URL: wiki.libsdl.org/…Servant
A
19

It's useful for setting bits in, say, registers instead of shift/mask operations:

typedef union {
    unsigned int as_int; // Assume this is 32-bits
    struct {
        unsigned int unused1 : 4;
        unsigned int foo : 4;
        unsigned int bar : 6;
        unsigned int unused2 : 2;
        unsigned int baz : 3;
        unsigned int unused3 : 1;
        unsigned int quux : 12;
    } field;
} some_reg;

Note: Which way the packing happens is machine-dependent.

some_reg reg;
reg.field.foo = 0xA;
reg.field.baz = 0x5;
write_some_register(some_address, reg.as_int);

I might have blown some syntax somewhere in there, my C is rusty :)

EDIT:

Incidentally, this works the opposite way also:

reg.as_int = read_some_register(some_address);
if(reg.field.bar == BAR_ERROR1) { ...
Alliber answered 13/5, 2009 at 14:4 Comment(6)
@scottfrazer, If your code is valid... that is a REALLY useful method for adjusting bit fields in a network packet! Not only does it define the packet structure 100% and make the code more clear... but it makes the implementation code easier to read because you are just setting the fields rather than doing lots of ( mask n bits, shift M bits, or n bits back in with the uint32_t... etc etc ).Landgraviate
Isn't it undefined behavior to set one field in a union and read from another?Meza
@Kristo: no, because by definition the UNION maps the different data layout to the same memory location. Therefore, the UNION allows you to access the same memory location in different ways.Starobin
@Kristo, no its Implementation Defined behavior, meaning it isn't portable between implementations. But hardware registers aren't usually very portable either. With network packets, it is a useful trick, but carefully create unit tests to document the assumptions you are making about the implementation so you don't get burned.Squarerigger
@trevor: I'd argue that it doesn't define the packet structure 100%, because it's up to the compiler which order the bits are stored in, and how they're packed.Disorient
Actually, Kristo is right. It is undefined to read any member of a union but the last one written. And if there has been no write yet, it's UB to read any of them. It's neither unspecified ("undocumented") nor implementation-defined ("documented") behavior; crashes are definitely allowed. This might be especially apparent when the CPU has different registers for int and float variables and a union contains both. E.g. "u.f = 1.0+1.2; printf("%d", u.i);" might not cause a writeback of the float register to memory before the printf.Strobe
F
10

Indeed, it's a great tool when you write things like device drivers (a struct that you want to send to device that can have several similar but different formats) and you require precise memory arrangement...

Faint answered 13/5, 2009 at 13:50 Comment(0)
K
8

You should be aware that in C++ they are not such a great solution, as only POD (plain old data) types can be placed in a union. If your class has a constructor, destructor, contains classes that have constructors and/or destructors (and about a million other gotchas), it cannot be a member of a union.

Kaitlynkaitlynn answered 13/5, 2009 at 14:0 Comment(4)
@Neil, "C++ they are not such a great solution, as only POD (plain old data)" ... yup you are correct. And that is exactly why I was VERY VERY tempted to tag this as a C only question... but I didn't because technically... unions are valid in C++. But C++ has more tools available which IMO would make using Unions in C++ obsolete.Landgraviate
Don't forget that there are features of C++ that are there at least in part to support interoperability with C. Unions are likely in that category.Squarerigger
Not to mention that in C++ there are things like boost::variant that handle similar functionality to a union, but with type safety.Becky
As mentioned elsewhere, unions are really good for hardware access, so that's one more reason to keep them in C++.Leialeibman
Z
6

Union is the simplest way to implement VARIANT-like data types in C/C++, I suppose.

Zedoary answered 13/5, 2009 at 13:51 Comment(1)
Uhhh...I've seen that done. It's...unfortunate. In C++ there are better ways. Frankly, I think that in C you're better off with an anonymous blob type (void *) and functions to query it and convert to whatever you need, than trying to use a union with lots of types.Leialeibman
S
5

It's often used in the specification of data transmission protocols, where you'd want to avoid wasting space in your data structures. It allows memory space to be saved by using the same space for multiple mutually exclusive options.

For example:

enum PacketType {Connect, Disconnect};
struct ConnectPacket {};
struct DisconnectPacket {};
struct Packet
{
    // ...
    // various common data
    // ...
    enum PacketType type;
    union
    {
        ConnectPacket connect;
        DisconnectPacket disconnect;
    } payload;
};

The ConnectPacket and DisconnectPacket structures occupy the same space, but that's ok because a packet can't be both types at the same time. The enum value is used to determine which part of the union is in use. Using the union has allowed us to avoid duplicating the common parts of the Packet structure.

Speaks answered 13/5, 2009 at 14:10 Comment(1)
Yes it's handy for bit twiddling too, but there are more potential benefits. See the example I just added.Speaks
V
4

Consider the case of accessing individual bytes within a large variable:

UInt32 x;
x = 0x12345678;
int byte_3 = x & 0x000000FF;          // 0x78
int byte_2 = (x & 0x0000FF00) >> 8;   // 0x56
int byte_1 = (x & 0x00FF0000) >> 16;  // 0x34
int byte_0 = (x & 0xFF000000) >> 24;  // 0x12

This can be far more elegant with a union:

typedef union
{
    UInt32 value;  // 32 bits
    Byte byte[4];  // 4 * 8 bits
}
UInt32_Bytes;

UInt32_Bytes x;
x.value = 0x12345678;
int byte_3 = x.byte[3];  // 0x78
int byte_2 = x.byte[2];  // 0x56
int byte_1 = x.byte[1];  // 0x34
int byte_0 = x.byte[0];  // 0x12

The use of a union means you no longer have to use bit masks and shift operators in order to access the individual bytes. It also makes the byte access explicit.

Vittle answered 13/5, 2009 at 13:59 Comment(8)
But it introduces endianess into code. So your two pieces of code are not equivalent.My
It also results in platform-dependent behaviour: in your first code snippet the value of byte_0 is always 0x12. In the second snippet, the value of byte_0 is 0x12 if the platform is big-endian, and 0x78 if it is little-endian.Abruption
Actually, scratch that, because byte[] is declared of type int. If you meant char, then what I said. If you meant int, then the values of byte_n are all unspecified for n > 0. Assuming 32bit ints, of course. And 8-bit bytes.Abruption
@n0rd: I'm pretty sure that the first snippet also introduces endianness because of the shift operators. I see your point about equivalence, though. I've only ever used this on embedded hardware, where I was 100% sure of the target memory architecture.Vittle
@onebyone.livejournal.com: Good point. I'll switch it to 'Byte' in the hopes of making it clearer. The endianness is still a valid issue.Vittle
@eJames: endianness does not affect the bit order within a variable -- the shifts will work equally well on either big or little endian. You only run into problems when you're accessing the memory at a granularity less than the size of the data type (i.e. access bytes of an int).Willow
@eJames, "I've only ever used this on embedded hardware, where I was 100% sure of the target memory architecture." Ya I have used that sort of mask/shift/or/etc trick a ton in my embedded coding... we even have a macro for it to toggle a single bit in an array of chars... lol.Landgraviate
Paul R's answer (#2753489) for another question is perhaps a better example, one that isn't affected by endianness.Marzipan
A
4

It's quite a good way to get the IEEE bit values of a float (assuming of course that floats are IEEE on your system). Anything which involves casting float* to int* risks tripping over the strict aliasing rules. This isn't just theoretical - high levels of optimisation actually will break your code.

Technically, union does not deal with the problem. In practice, all known compilers will (a) allow you to write one member of a union and read back another, and (b) perform the read after performing the write. GCC at least is capable of rolling the union into a register, turning the whole thing into a no-op (assuming floats are stored in registers to begin with).

Abruption answered 13/5, 2009 at 14:23 Comment(0)
F
2

We've used unions in lots of code for network packet parsing.

Union allocates the size of the biggest element. You would create a union with a buffer element of maximum message size, then you can easily access the values in the packet.

Imagine that data "c123456" arrived online and you need to parse and access the values:

  #include <iostream>
  using namespace std;

  struct msg
  {
     char header;
     union
     {
       char a[3];
       char b[2];
       char c[5];
       char d[6];
       char buf[10];
     } data;
  } msg;

  int main()
  {
    struct msg m;
    memcpy(&m, "c123456", sizeof("c123456"));

    cout << "m.header: " << m.header << endl;
    cout << "m.data.d: " << string(m.data.d,sizeof(m.data.d)) << endl;
    cout << "m.data.b: " << string(m.data.b,sizeof(m.data.b)) << endl;

    switch (m.header)
    {
     case 'a': cout << "a: " << string(m.data.a, sizeof(m.data.a)) << endl; break;
     case 'b': cout << "b: " << string(m.data.b, sizeof(m.data.b)) << endl; break;
     case 'c': cout << "c: " << string(m.data.c, sizeof(m.data.c)) << endl; break;
     default: break;
    }
  }

The output would look like this:

m.header: c
m.data.d: 123456
m.data.b: 12
c: 12345
Foreland answered 13/5, 2009 at 15:16 Comment(1)
I would point out that one also needs to look at the structure alignment issues. For the technique to work, you need to be sure there is no padding between fields, to make them align on native boundaries.Berman
C
2

I know this has been repeated, but I will just post a code sample to see how unions do add to elegance and efficiency when reading network traffic:

#pragma packed(1)
struct header_t {
   uint16_t msg_id;
   uint16_t size;
};
struct command_t {
   uint8_t cmd;
};
struct position_t {
   uint32_t x;
   uint32_t y;
   uint32_t z;
};
// ... Rest of the messages in an IDS
struct message {
   header_t header;
   union {
      command_t command;
      position_t position;
   } body;
};
#pragma packed(0)
message read( int socket ) {
   message data;
   unsigned int readed = read( socket, &data, sizeof(header_t) );
   // error checks... readed bytes smaller than header size and such
   readed = read( socket, &(data.body), data.header.size ); 
   // error checks...
}

In the snippet above you can perform the message read in place, and you do not need to care about the concrete type of object received. If you did not use the union, you would be left with reading the header, extracting both the size and the type, instantiating an object of the appropriate type (either in a hierarchy or to include inside a variant type as boost::any/boost::variant), and performing the second read on the newly created space.

We use this solution extensively to control simulators (some companies do not appreciate 'new' technologies like DDS or HLA and still depend on raw UDP/TCP data for their simulators). In the network layer we use unions that are transformed into internal data structures (network-to-host conversion, data scaling...) before feeding it into the application layers. As it was mentioned before, you must be careful with the padding at all times.

Cynar answered 2/9, 2009 at 19:11 Comment(0)
I
1

I used it once for a rough kind of data polymorphism in a similar way to markh44's answer. I had several different kinds of data that I wanted potentially to use. I created a union of all of those types and a struct that contained the union and a code defining which type was to be used.


union
{
    data_type_1;
    data_type_2;
    data_type_3;
} data_union;

typedef struct _TAG_DATA_WRAPPED_
{
    data_union data;
    int data_type; //better an enum
} WRAPPED_DATA;

WRAPPED_DATA loads_of_data[1024];


To answer your question about why this is advantageous:

What this allows you to do is easily allocate lists or arrays of different sorts of data and programatically manage their type. The big issue is of course storage space because if the types have very different storage sizes you can waste a lot of space.

Ingeringersoll answered 13/5, 2009 at 16:37 Comment(0)
G
0

I think this one is a good example:

        struct fieldsv4{
            unsigned int ip4 : 8;
            unsigned int ip3 : 8;
            unsigned int ip2 : 8;
            unsigned int ip1 : 8;
        };
        typedef union {
            unsigned int ip32; // Assume this is 32-bits
            struct fieldsv4 part;
        } ipv4;

        ipv4 dir1;
        struct fieldsv4 f1 = {1, 1, 168, 192}; //for little endian depending OS  for big endian do not invert
        dir1.part = f1;
        ipv4 dir2= dir1;
        dir2.part.ip4 = 2;
        printf("%d.%d.%d.%d\n", dir2.part.ip1, dir2.part.ip2, dir2.part.ip3, dir2.part.ip4);
        printf("%d.%d.%d.%d\n", dir1.part.ip1, dir1.part.ip2, dir1.part.ip3, dir1.part.ip4);
        printf("%X\n", dir1.ip32 ^ dir2.ip32);
Googly answered 14/2, 2020 at 17:3 Comment(5)
Won't that output of that code be different on different platforms? If so, how is it useful?Rumsey
With your IP as a simple unsigned char [4] it is better readable – you wouldn't have that .part everywhere. An IP would never be used as one big integer anyway (it adds endianness as a potential issue).Schurman
It is true, the big-little endian problem is a serious portable issue.Resoluble
it is an honor that David Schwartz respond this. I did make one million dollars thanks an invest in your enterprise (5 years ago). Your enterprise going successful thanks you and other as you. I very thankful of your good job.Resoluble
with char[4] you can not use as a var, arrays are const and you can not do this: struct fieldsv4 f1 = {1, 1, 168, 192}; dir1.part = f1;Resoluble

© 2022 - 2024 — McMap. All rights reserved.