How can mixed data types (int, float, char, etc) be stored in an array?
Asked Answered
B

6

154

I want to store mixed data types in an array. How could one do that?

Bidding answered 2/9, 2013 at 16:26 Comment(1)
It's possible and there are use cases, but this is likely a flawed design. That's not what arrays are for.Beset
T
255

You can make the array elements a discriminated union, aka tagged union.

struct {
    enum { is_int, is_float, is_char } type;
    union {
        int ival;
        float fval;
        char cval;
    } val;
} my_array[10];

The type member is used to hold the choice of which member of the union is should be used for each array element. So if you want to store an int in the first element, you would do:

my_array[0].type = is_int;
my_array[0].val.ival = 3;

When you want to access an element of the array, you must first check the type, then use the corresponding member of the union. A switch statement is useful:

switch (my_array[n].type) {
case is_int:
    // Do stuff for integer, using my_array[n].ival
    break;
case is_float:
    // Do stuff for float, using my_array[n].fval
    break;
case is_char:
    // Do stuff for char, using my_array[n].cvar
    break;
default:
    // Report an error, this shouldn't happen
}

It's left up to the programmer to ensure that the type member always corresponds to the last value stored in the union.

Thallophyte answered 2/9, 2013 at 16:31 Comment(7)
+1 This is the implentation of many interpreting languages written in CDareece
@texasbruce also called a "tagged union". I'm using this technique too in my own language. ;)Semivitreous
Wikipedia uses a disambiguation page for "discriminated union" - "disjoint union" in set theory and, as @H2CO3 mentioned, "tagged union" in computer science.Jem
And the first line of the Wikipedia Tagged union page says: In computer science, a tagged union, also called a variant, variant record, discriminated union, disjoint union, or sum type, ... It's been reinvented so many times it has many names (kind of like dictionaries, hashes, associative arrays, etc.).Thallophyte
@Thallophyte I've re-written it as "tagged union" but then read your comment. Rolled back the edit, I didn't mean to vandalize your answer.Semivitreous
+1 When I first found this question, I became excited because I had used tagged unions before to store "objects" of multiple data types in an array. Then I saw your answer... The only thing I would add is that having a switch on the enum type is useful inside an overloaded operator<< if you want to print out an object of union type, and you don't necessarily want main() to know what data type is inside the union. Like when you want to do a "dumb" loop that goes through all the elements in an array and prints them.Incidence
@Incidence This is C, not C++. Modern C++ has better solutions like std::variant.Thallophyte
S
33

Use a union:

union {
    int ival;
    float fval;
    void *pval;
} array[10];

You will have to keep track of the type of each element, though.

Semivitreous answered 2/9, 2013 at 16:28 Comment(0)
B
21

Array elements need to have the same size, that is why it's not possible. You could work around it by creating a variant type:

#include <stdio.h>
#define SIZE 3

typedef enum __VarType {
  V_INT,
  V_CHAR,
  V_FLOAT,
} VarType;

typedef struct __Var {
  VarType type;
  union {
    int i;
    char c;
    float f;
  };
} Var;

void var_init_int(Var *v, int i) {
  v->type = V_INT;
  v->i = i;
}

void var_init_char(Var *v, char c) {
  v->type = V_CHAR;
  v->c = c;
}

void var_init_float(Var *v, float f) {
  v->type = V_FLOAT;
  v->f = f;
}

int main(int argc, char **argv) {

  Var v[SIZE];
  int i;

  var_init_int(&v[0], 10);
  var_init_char(&v[1], 'C');
  var_init_float(&v[2], 3.14);

  for( i = 0 ; i < SIZE ; i++ ) {
    switch( v[i].type ) {
      case V_INT  : printf("INT   %d\n", v[i].i); break;
      case V_CHAR : printf("CHAR  %c\n", v[i].c); break;
      case V_FLOAT: printf("FLOAT %f\n", v[i].f); break;
    }
  }

  return 0;
}

The size of the element of the union is the size of the largest element, 4.

Beefcake answered 2/9, 2013 at 16:47 Comment(0)
T
8

There's a different style of defining the tag-union (by whatever name) that IMO make it much nicer to use, by removing the internal union. This is the style used in the X Window System for things like Events.

The example in Barmar's answer gives the name val to the internal union. The example in Sp.'s answer uses an anonymous union to avoid having to specify the .val. every time you access the variant record. Unfortunately "anonymous" internal structs and unions is not available in C89 or C99. It's a compiler extension, and therefore inherently non-portable.

A better way IMO is to invert the whole definition. Make each data type its own struct, and put the tag (type specifier) into each struct.

typedef struct {
    int tag;
    int val;
} integer;

typedef struct {
    int tag;
    float val;
} real;

Then you wrap these in a top-level union.

typedef union {
    int tag;
    integer int_;
    real real_;
} record;

enum types { INVALID, INT, REAL };

Now it may appear that we're repeating ourselves, and we are. But consider that this definition is likely to be isolated to a single file. But we've eliminated the noise of specifiying the intermediate .val. before you get to the data.

record i;
i.tag = INT;
i.int_.val = 12;

record r;
r.tag = REAL;
r.real_.val = 57.0;

Instead, it goes at the end, where it's less obnoxious. :D

Another thing this allows is a form of inheritance. Edit: this part is not standard C, but uses a GNU extension.

if (r.tag == INT) {
    integer x = r;
    x.val = 36;
} else if (r.tag == REAL) {
    real x = r;
    x.val = 25.0;
}

integer g = { INT, 100 };
record rg = g;

Up-casting and down-casting.


Edit: One gotcha to be aware of is if you're constructing one of these with C99 designated initializers. All member initializers should be through the same union member.

record problem = { .tag = INT, .int_.val = 3 };

problem.tag; // may not be initialized

The .tag initializer can be ignored by an optimizing compiler, because the .int_ initializer that follows aliases the same data area. Even though we know the layout (!), and it should be ok. No, it ain't. Use the "internal" tag instead (it overlays the outer tag, just like we want, but doesn't confuse the compiler).

record not_a_problem = { .int_.tag = INT, .int_.val = 3 };

not_a_problem.tag; // == INT
Tortoni answered 3/9, 2013 at 6:24 Comment(1)
.int_.val doesn't alias the same area though because the compiler knows that .val is at a greater offset than .tag . Have you got a link to further discussion about this alleged problem?Mosera
S
5

You can do a void * array, with a separated array of size_t. But you lose the information type.
If you need to keep information type in some way keep a third array of int (where the int is an enumerated value) Then code the function that casts depending on the enum value.

Saucedo answered 2/9, 2013 at 16:28 Comment(1)
you can also store the type information in the pointer itselfKries
K
4

Union is the standard way to go. But you have other solutions as well. One of those is tagged pointer, which involves storing more information in the "free" bits of a pointer.

Depending on architectures you can use the low or high bits, but the safest and most portable way is using the unused low bits by taking the advantage of aligned memory. For example in 32-bit and 64-bit systems, pointers to int must be multiples of 4 (assuming int is a 32-bit type) and the 2 least significant bits must be 0, hence you can use them to store the type of your values. Of course you need to clear the tag bits before dereferencing the pointer. For example if your data type is limited to 4 different types then you can use it like below

void* tp; // tagged pointer
enum { is_int, is_double, is_char_p, is_char } type;
// ...
uintptr_t addr = (uintptr_t)tp & ~0x03; // clear the 2 low bits in the pointer
switch ((uintptr_t)tp & 0x03)           // check the tag (2 low bits) for the type
{
case is_int:    // data is int
    printf("%d\n", *((int*)addr));
    break;
case is_double: // data is double
    printf("%f\n", *((double*)addr));
    break;
case is_char_p: // data is char*
    printf("%s\n", (char*)addr);
    break;
case is_char:   // data is char
    printf("%c\n", *((char*)addr));
    break;
}

If you can make sure that the data is 8-byte aligned (like for pointers in 64-bit systems, or long long and uint64_t...), you'll have one more bit for the tag.

This has one disadvantage that you'll need more memory if the data have not been stored in a variable elsewhere. Therefore in case the type and range of your data is limited, you can store the values directly in the pointer. This technique has been used in the 32-bit version of Chrome's V8 engine, where it checks the least significant bit of the address to see if that's a pointer to another object (like double, big integers, string or some object) or a 31-bit signed value (called smi - small integer). If it's an int, Chrome simply does an arithmetic right shift 1 bit to get the value, otherwise the pointer is dereferenced.


On most current 64-bit systems the virtual address space is still much narrower than 64 bits, hence the high most significant bits can also be used as tags. Depending on the architecture you have different ways to use those as tags. ARM, 68k and many others can be configured to ignore the top bits, allowing you to use them freely without worrying about segfault or anything. From the linked Wikipedia article above:

A significant example of the use of tagged pointers is the Objective-C runtime on iOS 7 on ARM64, notably used on the iPhone 5S. In iOS 7, virtual addresses are 33 bits (byte-aligned), so word-aligned addresses only use 30 bits (3 least significant bits are 0), leaving 34 bits for tags. Objective-C class pointers are word-aligned, and the tag fields are used for many purposes, such as storing a reference count and whether the object has a destructor.

Early versions of MacOS used tagged addresses called Handles to store references to data objects. The high bits of the address indicated whether the data object was locked, purgeable, and/or originated from a resource file, respectively. This caused compatibility problems when MacOS addressing advanced from 24 bits to 32 bits in System 7.

https://en.wikipedia.org/wiki/Tagged_pointer#Examples

On x86_64 you can still use the high bits as tags with care. Of course you don't need to use all those 16 bits and can leave out some bits for future proof

In prior versions of Mozilla Firefox they also use small integer optimizations like V8, with the 3 low bits used to store the type (int, string, object... etc.). But since JägerMonkey they took another path (Mozilla’s New JavaScript Value Representation, backup link). The value is now always stored in a 64-bit double precision variable. When the double is a normalized one, it can be used directly in calculations. However if the high 16 bits of it are all 1s, which denote an NaN, the low 32-bits will store the address (in a 32-bit computer) to the value or the value directly, the remaining 16-bits will be used to store the type. This technique is called NaN-boxing or nun-boxing. It's also used in 64-bit WebKit's JavaScriptCore and Mozilla's SpiderMonkey with the pointer being stored in the low 48 bits. If your main data type is floating-point, this is the best solution and delivers very good performance.

Read more about the above techniques: https://wingolog.org/archives/2011/05/18/value-representation-in-javascript-implementations

Kries answered 19/7, 2015 at 18:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.