Is unsigned char a[4][5]; a[1][7]; undefined behavior?
Asked Answered
S

6

9

One of the examples of undefined behavior from the C standard reads (J.2):

— An array subscript is out of range, even if an object is apparently accessible with the given subscript (as in the lvalue expression a[1][7] given the declaration int a[4][5]) (6.5.6)

If the declaration is changed from int a[4][5] to unsigned char a[4][5], does accessing a[1][7] still result in undefined behavior? My opinion is that it does not, but I have heard from others who disagree, and I'd like to see what some other would-be experts on SO think.

My reasoning:

  • By the usual interpretation of 6.2.6.1 paragraph 4, and 6.5 paragraph 7, the representation of the object a is sizeof (unsigned char [4][5])*CHAR_BIT bits and can be accessed as an array of type unsigned char [20] overlapped with the object.

  • a[1] has type unsigned char [5] as an lvalue, but used in an expression (as an operand to the [] operator, or equivalently as an operand to the + operator in *(a[1]+7)), it decays to a pointer of type unsigned char *.

  • The value of a[1] is also a pointer to a byte of the "representation" of a in the form unsigned char [20]. Interpreted in this way, adding 7 to a[1] is valid.

Segura answered 22/9, 2010 at 3:15 Comment(5)
Maybe I'm missing something really obvious (someone yell at me if I am), but what part of your reasoning using the unsigned char example doesn't also apply to the int example described in the standard?Thirtyone
@eldarerathis: It's not spelled out this way in the standard, but you're missing what people normally refer to as the strict aliasing rules, which do not apply to character types.Segura
More directly, note that accessing an arbitrary object as an array of int or some other type that overlaps it results in undefined behavior. The only types which can be used this way are the character types (char, signed char, and unsigned char).Segura
@R: Okay, I think I see what you're getting at now. Thanks.Thirtyone
Surely your example is just a case of "apparently accessible"? The simplest and most obvious compiler implementation will work as you suggest, but it need not. This is the case with any undefined behaviour, even if it appears to work, does not make it defined.Szeged
P
5

I would read this "informative example" in J2 as hint of what the standard body wanted: don't rely on the fact that accidentally an array index calculation gives something inside the "representation array" bounds. The intent is to ensure that all individual array bounds should always be in the defined ranges.

In particular, this allows for an implementation to do an aggressive bounds check, and to bark at you either at compile time or run time if you use a[1][7].

This reasoning has nothing to do with the underlying type.

Pyxie answered 22/9, 2010 at 6:41 Comment(0)
E
5

A compiler vendor who wants to write a conforming compiler is bound to what the Standard has to say, but not to your reasoning. The Standard says that an array subscript out of range is undefined behaviour, without any exception, so the compiler is allowed to blow up.

To cite my comment from our last discussion (Does C99 guarantee that arrays are contiguous?)

"Your original question was for a[0][6], with the declaration char a[5][5]. This is UB, no matter what. It is valid to use char *p = &a[3][4]; and access p[0] to p[5]. Taking the address &p[6] is still valid, but accessing p[6] is outside of the object, thus UB. Accessing a[0][6] is outside of the object a[0], which has type array[5] of chars. The type of the result is irrelevant, it is important how you reach it."

EDIT:

There are enough cases of undefined behaviour where you have to scan through the whole Standard, collect the facts and combine them to finally get to the conclusion of undefined behaviour. This one is explicit, and you even cite the sentence from the Standard in your question. It is explicit and leaves no space for any workarounds.

I'm just wondering how much more explicitness in reasoning do you expect from us to become convinced that it really is UB?

EDIT 2:

After digging through the Standard and collecting information, here is another relevant citation:

6.3.2.1 - 3: Except when it is the operand of the sizeof operator or the unary & operator, or is a string literal used to initialize an array, an expression that has type ‘‘array of type’’ is converted to an expression with type ‘‘pointer to type’’ that points to the initial element of the array object and is not an lvalue. If the array object has register storage class, the behavior is undefined.

So I think this is valid:

unsigned char *p = a[1]; 
unsigned char c = p[7]; // Strict aliasing not applied for char types

This is UB:

unsigned char c = a[1][7];

Because a[1] is not an lvalue at this point, but evaluated further, violating J.2 with an array subscript out of range. What really happens should depend on how the compiler actually implements the array indexing in multidimensional arrays. So you may be right that it doesn't make any difference on every known implementation. But that's a valid undefined behaviour, too. ;)

Embry answered 22/9, 2010 at 11:26 Comment(9)
If this is UB, it is purely a sort of theoretical UB which could never cause a problem in practice, because the usage is indistinguishable by a compiler from usage with perfectly well-defined behavior. Would you still claim it was UB if I serialized the pointer expression a[1] to a string with sprintf and read it back to a pointer variable with sscanf, then added 7 to that pointer? Because the value would be exactly the same as if I cast a to unsigned char * and added 12, and doing that is certainly well-defined.Segura
Regarding your last edit, how about these simplifications of your "valid version": (a[1]+0)[7], or ((unsigned char *)a[1])[7]?Segura
@R..: Yes, these may or may not be valid behaviour, after all. It depends on the evaluation process of an expression, in regard to lvalues, side-effects and sequence points, and maybe something more. It may even be different from compiler to compiler. It should be a nice academic exercise to dig deeper through the Standard and find the valid and invalid expressions for a[1][7].Embry
Fact is: Once the program hits UB, then the whole program becomes UB. It won't be re-defined by the fact that this specific UB is valid on all known compilers. Relying on it will bite you later on some exotic implementation. My rule of thumb is: If the construct is questionable, then don't use it. There are a gazillion other ways to do what you want to do, in a well-defined way. You never need to show in your code that you're "clever". Better show that you know good software engineering principles. ;)Embry
Suppose the standard said the program has undefined behavior if it meets some condition "X" that's provably impossible to test. Then the standard is simply including meaningless language that you can ignore. Unless someone can show the exact point at which this supposedly-out-of-bounds array access becomes out-of-bounds (i.e. minimal-difference expressions where one is valid and the other is not), my view is that this is meaningless language in the standard - that it's not only unlikely but impossible to write a compiler that meets the other conditions of the standard but rejects a[1][7].Segura
@R..: It does not matter. You do not write code exploiting undefined behaviour. Even if it works on "all" platforms for "all" time, it's still confusing at best for the next person to look at your source. At least I would be confused as hell if a[3][4] is declared and a[1][7] gets accessed. When I get confused, I barge into your office, wait until you explained why you did it, then fix the code and commit with "R.. screwed up" in the commit comment. (Just kidding, but only partially.)Palladino
"the usage is indistinguishable by a compiler from usage with perfectly well-defined behavior." "(...) Then the standard is simply including meaningless language that you can ignore." The general argument is sound, but in this case you are wrong: what if the compiler uses this standard clause to help alias analysis? Think of array types as a constraint.Quarrel
@curiousguy: Most of authors of C89 would never have even imagined that compilers would be capable of the kinds of optimization they do today, and saw no need to describe the precise limits of what optimizations a compiler can and can't do. A key feature of the design of C from the start was that given a pointer to the first element of an int[5][5];, one could easily write code to access the whole array. The Standard defined a number of ways of writing such code as equivalent, and then stated that one of them causes UB. If the authors of the Standard didn't want to weaken...Salutary
...the language, they should have specified that some of the ways of writing the code may be treated as equivalent to others at a compiler's convenience, but may allow a compiler to make optimizing assumptions that would not be allowed for the others. Best IMHO would be to say that given an int(*a)[5], the expressions a[n] and &(a[n]) would allow a compiler to assume that n was 0..4, but a compiler could not assume likewise for *(a+n) or (a+n). Unfortunately, I don't think anything in the Standard says that.Salutary
A
1

From 6.5.6/8

If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.

In your example, a[1][7] points to neither the same array object a[1], or one past the last element of a[1], so it is undefined behavior.

Acanthoid answered 22/9, 2010 at 3:47 Comment(4)
My claim is that there is another array in play, what I'll call (for lack of a better word) the "representation array" of the object in the form unsigned char [sizeof object]. Since a[1] decays to a pointer that's also a pointer into this representation array, and a[1]+7 is also within the bounds of that array, I claim a[1][7] is well-defined.Segura
@R..: Of course the construct a[1][7] makes sense on the machine-code, memory-cell level. That doesn't change the fact that "7" is an out-of-range array subscript on the C language level.Palladino
@R: Although a[1] decays to a pointer which represents the same address as ((unsigned char*)a)+5, the pointer is not the same type, and nothing would forbid a compiler from storing each pointer as a base and limit, and checking the limit on all indexing and dereferencing operations.Salutary
This "representation array" is simply something you made up. It is not required by the standard, so any reasoning that requires such a thing is also not required by the standard. Your compiler might just happen to work that way, but mine might not. Something that relies on how your compiler happens to work is the definition of undefined (or unspecified) behavior.Ruffled
B
0

Under the hood, in the actual machine language, there is no difference between a[1][7] and a[2][2] for the definition of int a[4][5]. As R.. said, this is because the array access is translated to 1 * sizeof(a[0]) + 7 = 12 and 2 * sizeof(a[0]) + 2 = 12 (* sizeof(int) of course). The machine language doesn't know anything about arrays, matrices or indexes. All it knows about addresses. The C compiler above that can do anything it pleases, including naive bounds checking base on the indexer - a[1][7] would then be out of bound because the array a[1] doesn't have 8 cells. In this respect there is no difference between an int and char or unsigned char.

My guess is that the difference lies in the strict aliasing rules between int and char - even though the programmer doesn't actually do anything wrong, the compiler is forced to do a "logical" type cast for the array which it shouldn't do. As Jens Gustedt said, it looks more like a way to enable strict bounds checks, not a real issue with the int or char.

I've done some fiddling with the VC++ compiler and it seems to behave as you'd expect. Can anyone test this with gcc? In my experience gcc is much stricter about these sort of things.

Brat answered 22/9, 2010 at 8:16 Comment(1)
Note that given declaration char arr[5][5]; gcc will sometimes treat code that accesses arr[0][x] as an invitation to assume that a program will never receive input that would cause x to be greater than 4, and behave in arbitrarily nonsensical fashion if it does.Salutary
S
0

When an action is simultaneously defined and characterized as undefined the Standard allows implementations to freely choose from among the following ways of treating such cases:

  1. Give priority to the parts of the Standard and other documentation that would define the behavior.

  2. Give absolute and total priority to anything that would characterize the action as invoking Undefined Behavior, or could conceivably be interpreted in that fashion, even if the otherwise-defined behavior would have been useful.

  3. Give priority to the defined behavior, except when there is an obvious or documented reason for doing otherwise.

Implementations that respect the Spirit of C described in every version of the Committee's to date will adopt stance #1 or #3. In the scenario described, gcc can be demonstrated to adopt stance #2. Clang would adopt stance #2 for other scenarios, but I'm unaware of it doing so for this particular combination of defined and undefined behavior.

The quirky behavior of gcc in this scenario can be demonstrated below:

int arr[5][3];
int arr2[4];
static int sum_rows(int n)
{
  int count=n*3;
  int sum;
  for (int i=0; i<count; i++)
    sum += arr[0][i];
  return sum;
}
int test(int n)
{
    int result = sum_rows(n);
    if (n < 3)
        arr2[n] = 2;
}

The machine code gcc generates for test at -O2 or -O3 is equivalent to:

void test(int n)
{
   arr2[n] = 2;
}

Performing the store unconditionally, without regard for the value of n.

Salutary answered 14/5 at 22:5 Comment(0)
G
-1

I believe that the reason the cited (J.2) sample is undefined behavior is that the linker is not required to put the sub-arrays a[1], a[2], etc. next to each other in memory. They could be scattered across memory or they could be adjacent but not in the expected order. Switching the base type from int to unsigned char changes none of this.

Glycoside answered 22/9, 2010 at 3:23 Comment(2)
No, they certainly have to be adjacent in memory. Think about how sizeof and memcpy work. As far as I can tell, the reason is aliasing-related. You want the compiler to be able to assume a[0][i] and a[1][j] never point to the same location. But all bets are off with character types. A pointer to a character type can always alias other pointers (to any type) unless the restrict keyword is used to tell the compiler it won't.Segura
The linker does no decide object layout, ever.Quarrel

© 2022 - 2024 — McMap. All rights reserved.