Does C99 guarantee that arrays are contiguous?
Asked Answered
N

3

21

Following an hot comment thread in another question, I came to debate of what is and what is not defined in C99 standard about C arrays.

Basically when I define a 2D array like int a[5][5], does the standard C99 garantee or not that it will be a contiguous block of ints, can I cast it to (int *)a and be sure I will have a valid 1D array of 25 ints.

As I understand the standard the above property is implicit in the sizeof definition and in pointer arithmetic, but others seems to disagree and says casting to (int*) the above structure give an undefined behavior (even if they agree that all existing implementations actually allocate contiguous values).

More specifically, if we think an implementation that would instrument arrays to check array boundaries for all dimensions and return some kind of error when accessing 1D array, or does not give correct access to elements above 1st row. Could such implementation be standard compilant ? And in this case what parts of the C99 standard are relevant.

Novelty answered 14/5, 2010 at 9:13 Comment(0)
L
19

We should begin with inspecting what int a[5][5] really is. The types involved are:

  • int
  • array[5] of ints
  • array[5] of arrays

There is no array[25] of ints involved.

It is correct that the sizeof semantics imply that the array as a whole is contiguous. The array[5] of ints must have 5*sizeof(int), and recursively applied, a[5][5] must have 5*5*sizeof(int). There is no room for additional padding.

Additionally, the array as a whole must be working when given to memset, memmove or memcpy with the sizeof. It must also be possible to iterate over the whole array with a (char *). So a valid iteration is:

int  a[5][5], i, *pi;
char *pc;

pc = (char *)(&a[0][0]);
for (i = 0; i < 25; i++)
{
    pi = (int *)pc;
    DoSomething(pi);
    pc += sizeof(int);
}

Doing the same with an (int *) would be undefined behaviour, because, as said, there is no array[25] of int involved. Using a union as in Christoph's answer should be valid, too. But there is another point complicating this further, the equality operator:

6.5.9.6 Two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space. 91)

91) Two objects may be adjacent in memory because they are adjacent elements of a larger array or adjacent members of a structure with no padding between them, or because the implementation chose to place them so, even though they are unrelated. If prior invalid pointer operations (such as accesses outside array bounds) produced undefined behavior, subsequent comparisons also produce undefined behavior.

This means for this:

int a[5][5], *i1, *i2;

i1 = &a[0][0] + 5;
i2 = &a[1][0];

i1 compares as equal to i2. But when iterating over the array with an (int *), it is still undefined behaviour, because it is originally derived from the first subarray. It doesn't magically convert to a pointer into the second subarray.

Even when doing this

char *c = (char *)(&a[0][0]) + 5*sizeof(int);
int  *i3 = (int *)c;

won't help. It compares equal to i1 and i2, but it isn't derived from any of the subarrays; it is a pointer to a single int or an array[1] of int at best.

I don't consider this a bug in the standard. It is the other way around: Allowing this would introduce a special case that violates either the type system for arrays or the rules for pointer arithmetic or both. It may be considered a missing definition, but not a bug.

So even if the memory layout for a[5][5] is identical to the layout of a[25], and the very same loop using a (char *) can be used to iterate over both, an implementation is allowed to blow up if one is used as the other. I don't know why it should or know any implementation that would, and maybe there is a single fact in the Standard not mentioned till now that makes it well defined behaviour. Until then, I would consider it to be undefined and stay on the safe side.

Livingstone answered 14/5, 2010 at 13:2 Comment(14)
@Secure: I believe the rationale behind this definition is related to cellperformance.beyond3d.com/articles/2006/06/…. After reading this I believe that the standard chose a larger than necessary undefined behavior and that stating that concurrent accesses both through original pointer and casted one has undefined behavior would be enough, but OK they are on the safe side.Novelty
@Secure: so would you agree that, had the original integer type used in the array been char (or unsigned char?) instead of int, things like a[0][6] would be valid and well-defined?Standoff
@R..: No, this is explicitely listed as undefined behaviour. J.2: "An array subscript is out of range, even if an object is apparently accessible with the given subscript (as in the lvalue expression a[1][7] given the declaration int a[4][5])(6.5.6)."Livingstone
@Secure: But any object can be accessed as an overlaid array of unsigned char of size equal to the size of the object. J.2 uses the example of int, and to me it seems clear from this other requirement that J.2 would not be an example of UB if the type were unsigned char, but I thought I'd ask you since you seem to have put a lot of thought in these issues too.Standoff
@R..: But it is not an overlaid array of chars, you still access it as an array[5][5]. This is a different issue. The array subscript out of range UB doesn't make an exception for any type, like this from J.2: "A trap representation is read by an lvalue expression that does not have character type (6.2.6.1)." Thus it is always undefined behaviour.Livingstone
Well &array[0][0] and *(unsigned char (*)[25])&array and (unsigned char *)array and array[0] all evaluate to identical pointers to unsigned char. As far as I know, they're required to be equal (compare equal with ==). How is it valid to access the overlaid array of type unsigned char [25] with some but not others - and which ones is it valid to use? J.2 is informative, and presumably correct in the example it gives, but that doesn't mean it extends to other examples that seem similar on the surface.Standoff
@R..: As far as I see it you're right here, but the subscript out of range is a completely unrelated issue. Your original question was for a[0][6], with the declaration char a[5][5]. This is UB, no matter what. It is valid to use char *p = &a[3][4]; and access p[0] to p[5]. Taking the address &p[6] is still valid, but accessing p[6] is outside of the object, thus UB. Accessing a[0][6] is outside of the object a[0], which has type array[5] of chars. The type of the result is irrelevant, it is important how you reach it.Livingstone
@Secure: From what I understand of the C standard, it would be legal for every pointer to be implemented as a structure holding the current value along with maximum and minimum legitimate values. In such a case, given "char a[5][5];", assuming 'a' ends up at address 1000 (decimal) "(char*)a" would be a pointer whose current and minimum values were 1000 and whose maximum value was 1025; "(char*)(a[1])" would be a pointer whose current and minimum value were 1005, and whose maximum value was 1010. Such an implementation could trap any time a pointer was indexed illegitimately.Meingolda
@supercat: The min/max trapping for character pointer types is not valid unless you have a way of detecting the "entire object" the pointer is pointing into and expanding the min/max to encompass the whole thing. That's because (char *)&a[0][0] is a pointer both to the first byte of the int at a[0][0] and a pointer to the first byte of the array, these two pointers must compare equal, and it's valid to use it to alias the whole object (the 2D array).Standoff
@R.: If a[0][6] is forbidden on an array declared as [5][5], would that not imply that the pointer could not be legitimately used for the whole object?Meingolda
@R.: Or are you suggesting that a pointer to the int at a[0][0] could be typecast back to a pointer to the 2d array. I guess that would be permissible, so the compiler would have to have some means of keeping track of what was going on. If each "pointer" held a reference to a descriptor of the calloc'ed object, along with an offset from the base of that object, the compiler could enforce the sizes of sub-arrays for things that were statically positioned within the calloc'ed memory. What does the standard say about aliasing structures onto something other than...Meingolda
@R:...the start of a calloc'ed area? I know I've seen plenty of code that does so, but is the behavior of such code guaranteed? For example, given two structs struct1 and struct2, can one calloc sizeof(struct1)+sizeof(struct2) bytes and then place struct 2 at the start of the area + sizeof(struct1) bytes, or is that ub? I would be inclined to guess UB, since there's no guarantee that sizeof(struct1) would place struct2 at a place meeting its alignment requirements.Meingolda
@supercat: The alignment issue is easy to fix. Just round up the offset to a multiple of sizeof(struct2). Alignment always divides the size of an object.Standoff
@Livingstone Why is it valid to iterate over the 2D array with a char pointer?Orchitis
P
13

I've added some more comments to our original discussion.

sizeof semantics imply that int a[5][5] is contiguous, but visiting all 25 integers via incrementing a pointer like int *p = *a is undefined behaviour: pointer arithmetics is only defined as long as all pointers invoved lie within (or one element past the last element of) the same array, as eg &a[2][1] and &a[3][1] do not (see C99 section 6.5.6).

In principle, you can work around this by casting &a - which has type int (*)[5][5] - to int (*)[25]. This is legal according to 6.3.2.3 §7, as it doesn't violate any alignment requirements. The problem is that accessing the integers through this new pointer is illegal as it violates the aliasing rules in 6.5 §7. You can work around this by using a union for type punning (see footnote 82 in TC3):

int *p = ((union { int multi[5][5]; int flat[25]; } *)&a)->flat;

This is, as far as I can tell, standards compliant C99.

Pompeii answered 14/5, 2010 at 9:28 Comment(5)
He could pass the int(*)[25] to another function legally, right? (as long as he doesn't dereference it within the same scope as the original array).Asiaasian
@Daniel: that would indeed be the typical use (and would be coherent with right to call memset or memcpy). But from reading C99, I do not really succeed making my mind on the subject. For now I will probably accept @Livingstone answer, because I understand the contiguous part exactly as he explained it.Novelty
Use of a union for this is undefined behavior. With unions, you can only read from the most-recently-written member.Standoff
@R.. It will have unspecified value only if the the one you're writing to is covers more bytes than the one recently written. Otherwise, C99-wise, it's okay. On the other hand, is the order of the second dimension guaranteed? i.e. &multi[1][4] == &flat[9] ?Raceway
@syockit: Both gcc and clang are too primitive or obtuse (I don't know which) to reliably recognize that the actions of taking the address of a union member, using that pointer, and abandoning it, all without having accessed the union in any other way, should collectively behave as an access to the union object. While the Standard doesn't explicitly require such recognition even in trivially easy cases, I think it implausible that the reason for such omission was a desire to avoid stating the obvious, rather than a desire to invite compilers to be willfully blind to such possibilities.Meingolda
A
2

If the array is static, like your int a[5][5] array, it's guaranteed to be contiguous.

Abba answered 14/5, 2010 at 9:19 Comment(1)
Might be a good idea to investigate the meaning of the word "static" in C.Harvester

© 2022 - 2024 — McMap. All rights reserved.