Why does the first element outside of a defined array default to zero?
Asked Answered
H

5

92

I'm studying for the final exam for my introduction to C++ class. Our professor gave us this problem for practice:

Explain why the code produces the following output: 120 200 16 0

using namespace std;
int main()
{
  int x[] = {120, 200, 16};
  for (int i = 0; i < 4; i++)
    cout << x[i] << " ";
}

The sample answer for the problem was:

The cout statement is simply cycling through the array elements whose subscript is being defined by the increment of the for loop. The element size is not defined by the array initialization. The for loop defines the size of the array, which happens to exceed the number of initialized elements, thereby defaulting to zero for the last element. The first for loop prints element 0 (120), the second prints element 1 (200), the third loop prints element 2 (16) and the forth loop prints the default array value of zero since nothing is initialized for element 3. At this point i now exceeds the condition and the for loop is terminated.

I'm a bit confused as to why that last element outside of the array always "defaults" to zero. Just to experiment, I pasted the code from the problem into my IDE, but changed the for loop to for (int i = 0; i < 8; i++). The output then changed to 120 200 16 0 4196320 0 547306487 32655. Why is there not an error when trying to access elements from an array that is outside of the defined size? Does the program just output whatever "leftover" data was there from the last time a value was saved to that memory address?

Hightest answered 13/12, 2021 at 20:46 Comment(22)
The behavior is undefined. Everything else doesn't matter.Runkel
That's so-called "undefined behaviour" which you can't rely on.Atween
It does not default to zero. The sample answer is wrong. Undefined behaviour is undefined.Rainfall
"The for loop defines the size of the array" --> No and "thereby defaulting to zero for the last element." --> No. Ask for tuition refund.Haematoblast
"Why is there not an error when trying to access elements from an array that are outside of the defined size" Because C++ won't hold your hand, at least not by default. I get an error if I compile with -fsanitize=address (on Clang, different compilers need different flags).Runkel
"The element size is not defined by the array initialization. The for loop defines the size of the array, ..." Both theese statements are wrong.Disillusion
I can't believe the professor is saying that the size of the array will be 4 because the loop tries to access 4 elements. The size of the array is 3, because that is how many elements the array is initialized with.Oeflein
Would make sense if int x[4] = {120, 200, 16};Haematoblast
Related: https://en.cppreference.com/w/cpp/language/ubJaban
Printing the last value does NOT print 0. I just tried it at godbolt.org/z/YdanEn4bd and it printed 120 200 16 3 but in reality you are unlucky that the program didn't crash because at least then you would know you aren't allowed to do that. And array bounds checking is possible: #4779052 you can use -O -fbounds-check which will sometimes catch it. Here is your program with the correct error message: godbolt.org/z/GG5Ts15E5Stinkpot
@SamVarshavchik second that. It's unbelievable how many people in higher education haven't even understood the simple basics. Worse, then they teach others.Martinmas
@JerryJeremiah "you are unlucky that the program didn't crash" in my experience, when working within an operating system, the OS does not do any sort of overflow checking at the byte or word boundary, but at the page boundary so overflown by one byte is rarely caught until much later in the control flow if at all. On an embedded platform with minimal to no OS services behavior is even more fun because "crashes" are vastly different.Winson
The professor may be confusing this with how the argv array works, or assuming that it applies to all arrays. argv is required to have an extra NULL element.Kushner
O, this example crashes when compiled with C++20 and -O2, it seems that loop end condition is optimised away: godbolt.org/z/hs6zeWbhj It managed to print quite a few valuesCivet
Is this really the question or has it been transformed somehow? Compare with boost unbounded_array. You also need to tell us the compiler and the flags, despite "undefined" there is "predictable". at least this example is reading and not clobbering.Commercialism
@Winson I pretty much only did embedded stuff - although nowadays I have more than 256 bytes of RAM. When you only have 256 bytes of RAM any overflow writes over something important on the stack and the processor resets - without any debugging info. The answer is "don't write code with bugs"Stinkpot
And I was wondering the other day how we still get buffer overflows these days....Oncoming
What I strongly suspect is that the professor didn't write down the code himself, but rather one of his assistant students, and they forgot the [4] in int x[]. Either that or the prof himself just made an typo. I can't believe a friggin PROFESSOR would make such a mistake, truly believing it to be correct ...Sklar
@SamVarshavchik: Either incompetent or merely careless and forgot the 4 in int x[4], as other comments have pointed out. I certainly thought careless at first, until I noticed chux's comment that it would be correct if the declared size of the array was big enough and larger than the initializer list. So let's be careful about being rude about it; at best this was an innocent but highly confusing mistake, written about code that existed in the prof's head, not what the students got. At worst, they don't have a clue how it works, or were assuming that fresh stack memory in main is 0.Impersonal
Memory is often initialised to 0 so while the behaviour is undefined, the fact it happens to be 0 is unsurprising, however it could be anything, or cause an error.Biolysis
It could be that they're getting mixed up with C-style strings, which are "null-terminated" arrays of characters meaning they have no fixed size and their end is determined by the first instance of a char containing 0.Excitement
Does this answer your question? Initialization of all elements of an array to one default value in C++?Lysenko
F
97

I'm a bit confused as to why that last element outside of the array always "defaults" to zero.

In this declaration

int x[] = {120, 200, 16};

the array x has exactly three elements. So accessing memory outside the bounds of the array invokes undefined behavior.

That is, this loop

 for (int i = 0; i < 4; i++)
 cout << x[i] << " ";

invokes undefined behavior. The memory after the last element of the array can contain anything.

On the other hand, if the array were declared as

int x[4] = {120, 200, 16};

that is, with four elements, then the last element of the array that does not have an explicit initializer will be indeed initialized to zero.

Fusiform answered 13/12, 2021 at 20:53 Comment(7)
So the answer is 'by sheer luck'Assured
@Assured In a sense, but more specifically it is likely "implementation defined behavior, dependent on compiler flags". If the result is consistently zero, something must set it to zero.Willawillabella
@Willawillabella I do not see evidence for that claim.Assured
@Willawillabella Please note that implementation-defined behaviour has a very specific meaning in the context of the C and C++ standards, and this is not it. Undefined behaviour is a much stronger claim with more far-reaching consequences. See this overview.Electropositive
@Electropositive it is a standard behavior not implementation defined behavior.Fusiform
@kdb: We don't use the term "implementation-defined" to describe what actually happened in cases of UB. It's obviously not actually going to be nasal demons; instead it depends on the details of the asm the compiler happened to produce, and what was in memory previously. "implementation-defined" would imply that the actual compiler actually took care to make sure you'd get zero, rather than happening to let you read some stack memory that was still zeroed by the kernel (like all fresh pages are to avoid leaking kernel data). That would explain an unoptimised build always printing 0.Impersonal
More strongly, they whole program has undefined behaviour. It doesn't have to print 4 numbers, it can print 3, or 5, or format your hard drive.Thusly
R
51

It does not default to zero. The sample answer is wrong. Undefined behaviour is undefined; the value may be 0, it may be 100. Accessing it may cause a seg fault, or cause your computer to be formatted.

As to why it's not an error, it's because C++ is not required to do bounds checking on arrays. You could use a vector and use the at function, which throws exceptions if you go outside the bounds, but arrays do not.

Rainfall answered 13/12, 2021 at 20:52 Comment(14)
To not scare OP, while it could theoretically generate code that formats your computer, what usually happens is you get a "random" number, which is usually what the memory contains at that location. Compilers nowadays protect programmers from themselves.Tarsal
@Tarsal When it comes to undefined behaviour, no compiler can protect all programmers from themselves all the time. Compilers have a finite capacity to detect badly behaving code, a finite capacity to isolate the effects of badly behaving code, and therefore a finite capacity to protect the programmer from consequences of the programmer's mistakes. Whereas, the capacity of programmers to make mistakes is infinite.Loewe
I really dislike scare examples like "or cause your computer to be formatted". While it's true that compilers assuming that undefined behaviour doesn't happen can lead into really surprising results, it's still rather difficult to see how the code for destroying the computer would magically appear. Unless the program already contains such code, but then it's a question of just program flow jumping around due to the UB, which is quite less far-fetched.Exhilarate
@Exhilarate Nasal demons. (Google that term.) Undefined behavior gives the implementation carte blanche to do anything and still be deemed compliant with the standard. The easy thing to do from an implementer's point of view in cases such as this is to ignore the UB. If an implementation detects UB, it would be nice if the implementation reported that as a problem. But an implementation that detects UB is also free to issue code that reformats your hard drive or that magically makes demons jump out of the programmer's nose. Invoking UB is supposed to be scary.Estey
@DavidHammen, yes, and if the implementation ignores the UB, or just does something with the assumption that UB can't happen (like in the famous Linux bug where they dereferenced a pointer before checking if it was NULL), then it does something, probably something wrong, but an implementation that inserts code to be damaging just "because the standard allows it to" is actively malicious, and the problem isn't with the buggy code any longer.Exhilarate
My point is that scary stories with fantastical results like that, repeated as memes, are not too productive. Focusing on realistic or real issues, ones that stem from logic that's by itself innocent and even sensible would be more useful. (Though of course in that case with Linux, the opinions vary on if the compiler logic was "sensible".)Exhilarate
@Exhilarate It cannot do something wrong, as anything goes, but unintended and harmful are certainly expected. And the point of the fantastic and catastrophic results (I'm partial to the classics, so "nasal demons") is that anything goes, even if due to lots of safeguards and the difficulty in providing enough monkeys to reliably trigger the infinite monkey syndrome something while puzzling being still more innocuous is more likely.Messene
@Deduplicator, For a perfect programmer, there is of course no "wrong" in case of a programmer error. But regular users aren't perfect, and "wrong" is more usefully defined against the user's reasonable expectations. Anyway, the reality is that on a modern OS, UB can't cause catastrophic results like that, since a regular user program can't access hardware or anyone's nose. Of course on an unprotected OS, UB can plausibly result, through some logical steps, in random I/O accesses leading to bad things, but that's a far cry from accepting a compiler to explicitly create such results.Exhilarate
@Exhilarate Nobody said the compiler (unless for the DS9K) deliberately steering UB into full-blown catastrophe would be acceptable. The point was more that if all the wrong circumstances apply, it might happen, however unlikely it is.Messene
@Exhilarate You are imagining that the computer has an MMU. If you have memory mapped IO and no memory protection then any overflow that writes over the return address could jump anywhere and do anything. Writing into a memory mapped IO location that controls the disk is a definite possibility - I had a bug once that caused intermittent interrupts that wrote a single random character to a random place on the disk so every so often one character in one file would change for no reason.Stinkpot
@Tarsal it's a bit of a stretch to say that modern compilers protect programmers. In some ways, modern compilers are worse than older compilers, since there has been a tendency in recent years for compilers to treat UB as an opportunity for aggressive optimisations, making the resulting UB more surprising than it would otherwise have been.Harod
@Offtkp: Compilers don't do anything remotely like "protecting programmers from themselves". Multitasking OSes with memory protection do that, meaning you pretty much can't accidentally do I/O in a use-space program. Compilers on the other hand aggressively assume no UB, and will treat paths of execution that lead to compile-time-visible UB as if they had __builtin_unreachable(), and will just stop emitting instructions for them, so you get functions with no ret or w/e. And Does the C++ standard allow for an uninitialized bool to crash a program?Impersonal
@Offtkp: Unless you mean compilers protect programmers by warning them. But here, even with compile-time loop unrolling by clang, we don't get a warning: godbolt.org/z/r1KdsMKh8. You'd have to use -fsanitize=undefined - godbolt.org/z/GWYeE457jImpersonal
@PeterCordes Thats's an interesting post I do concede that it was a stretch to claim that compilers protect programmers from themselvesTarsal
C
31

It's causing undefined behaviour, this is the only valid answer. Compiler expects your array x to contain exactly three elements, what you see in the output when reading fourth integer is unknown and on some systems/processors may cause hardware interrupt caused by trying to read memory which is not addressable (system don't know how to access physical memory at such address). Compiler might reserve for x memory from stack, or might use registers (as its very small). The fact you get 0 is actually accidental. With the use of address sanitizer in clang (-fsanitize=address option) you can see this:

https://coliru.stacked-crooked.com/a/993d45532bdd4fc2

the short output is:

==9469==ERROR: AddressSanitizer: stack-buffer-overflow

You can investigate it even further, on compiler explorer, with un-optimized GCC: https://godbolt.org/z/8T74cr83z (includes asm and program output)
In that version, the output is 120 200 16 3 because GCC put i on the stack after the array.

You will see that gcc generates following assembly for your array:

    mov     DWORD PTR [rbp-16], 120    # array initializer
    mov     DWORD PTR [rbp-12], 200
    mov     DWORD PTR [rbp-8], 16
    mov     DWORD PTR [rbp-4], 0       # i initializer

so, indeed - there is a fourth element with 0 value. But it's actually the i initializer, and has a different value by the time it's read in the loop. Compilers don't invent extra array elements; at best there will just be unused stack space after them.

See the optimization level of this example - its -O0 - so consistent-debugging minimal optimizations; that's why i is kept in memory instead of a call-preserved register. Start adding optimizations, lets say -O1 and you will get:

    mov     DWORD PTR [rsp+4], 120
    mov     DWORD PTR [rsp+8], 200
    mov     DWORD PTR [rsp+12], 16

More optimizations may optimize your array entirely, for example unrolling and just using immediate operands to set up calls to cout.operator<<. At that point the undefined-behaviour would be fully visible to the compiler and it would have to come up with something to do. (Registers for the array elements would be plausible in other cases, if the array values were only ever accessed by a constant (after optimization) index.)

Cavorelievo answered 13/12, 2021 at 21:8 Comment(6)
"memory on stack" I don't believe the standard says a declaration like this must be on the stack, most if not all compilers will put it on the stack but the standard is ambivalent.Winson
@sam I agree, compiler might put such array into registers - like I shown with compiler explorer. I will clarify my first sentence.Cavorelievo
@Sam: Indeed, a few C and C++ implementations don't use an asm "stack" at all, instead using dynamic allocation of automatic storage (notably IBM zSeries: Does C need a stack and a heap in order to run?). The standard says every object has an address (except register vars), but putting objects in registers is allowed per the as-if rule. Of course none of this implies anything about any behaviour required by the standard for this case; there is none for the whole program before or after the bad access; that's the whole point of UB.Impersonal
But yes, compilers will compile it into some concrete behaviour for a given build; if they don't full unroll the loop then there will definitely be an array in memory somewhere to index (since you can't variably index regs). If they don't spot the UB at compile time, you might even predict some of the possible things that could happen. If they do notice the UB, your compiler might just stop generating code for this path of execution, e.g. letting execution fall into whatever function is linked next after main. Or emit an illegal instruction like x86 ud2.Impersonal
The fourth element with value 0 under -O0 is actually the initial value for variable i.Makkah
@ralphmerridew: Well spotted; I didn't really read this answer carefully when I looked at it previously! The version linked on Godbolt actually prints 3, not 0. I edited this answer to correct the section talking about the version on Godbolt, because it was going in the wrong direction from wrong facts. Compilers don't invent array elements; at best there happens to be padding after them which they don't initialize. (Sometimes to their detriment, e.g. it's cheaper to init a 16-byte array than 15, and the extra byte is just padding in this case: godbolt.org/z/Er5GWqhY4)Impersonal
S
12

Correcting the answer

No it doesn't default to 0. It's undefined behaviour. It just happened to be 0 in this condition, this optimization and this compiler. Trying to access uninitialized or unallocated memory is undefined behaviour.

Because it's literally "undefined" and the standard has nothing else to say about this, your assembly output is not going to be consistent. The compiler might store the array in an SIMD register, who knows what the output will be?

Quote from the sample answer:

and the forth loop prints the default array value of zero since nothing is initialized for element 3

That's the most wrong statement ever. I guess there's a typo in the code and they wanted to make it

int x[4] = {120, 200, 16};

and mistakenly made it x[4] into just x[]. If not, and it was intentional, I don't know what to say. They're wrong.

Why isn't it an error?

It's not an error because that's how the stack works. Your application doesn't need to allocate memory in the stack to use it, it's already yours. You may do whatever with your stack as you wish. When you declare a variable like this:

int a;

all you're doing is telling the compiler, "I want 4 bytes of my stack to be for a, please don't use that memory for anything else." at compile time. Look at this code:

#include <stdio.h>

int main() {
    int a;
}

Assembly:

    .file   "temp.c"
    .text
    .globl  main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    endbr64
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6 /* Init stack and stuff */
    movl    $0, %eax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret /* Pop the stack and return? Yes. It generated literally no code.
           All this just makes a stack, pops it and returns. Nothing. */
    .cfi_endproc /* Stuff after this is system info, and other stuff
                 we're not interested. */
.LFE0:
    .size   main, .-main
    .ident  "GCC: (Ubuntu 11.1.0-1ubuntu1~20.04) 11.1.0"
    .section    .note.GNU-stack,"",@progbits
    .section    .note.gnu.property,"a"
    .align 8
    .long   1f - 0f
    .long   4f - 1f
    .long   5
0:
    .string "GNU"
1:
    .align 8
    .long   0xc0000002
    .long   3f - 2f
2:
    .long   0x3
3:
    .align 8
4:

Read the comments in the code for explanation.

So, you can see int x; does nothing. And if I turn on optimisations, the compiler won't even bother making a stack and doing all those stuff and instead directly return. int x; is just a compile-time command to the compiler to say:

x is a variable that is a signed int. It needs 4 bytes, please continue declaration after skipping these 4 bytes(and alignment).

Variables in high-level languages(of the stack) only exist to make the "distribution" of the stack more systematic and in a way that it's readable. The declaration of a variable is not a run-time process. It just teaches the compiler how to distribute the stack among the variables and prepare the program accordingly. When executing, the program allocates a stack(that's a run-time process) but it's already hardcoded with which variables get what part of the stack. For eg. variable a might get -0(%rbp) to -4(%rbp) while b gets -5(%rbp) to -8(%rbp). These values are determined at compile time. Names of variables also don't exist in compile time, they're just a way to teach the compiler how to prepare the program to use its stack.

You, as the user can use the stack as freely as you like; but you may not. You should always declare the variable or the array to let the compiler know.

Bounds checking

In languages like Go, even though your stack is yours, the compiler will insert extra checks to make sure you're not using undeclared memory by accident. It's not done in C and C++ for performance reasons and it causes the dreaded undefined behaviour and Segmentation fault to occur more frequently.

Heap and data section

Heap is where large data gets stored. No variables are stored here, only data; and one or more of your variables will contain pointers to that data. If you use stuff that you haven't allocated(done at run-time), you get a segmentation fault.

The Data section is another place where stuff can be stored. Variables can be stored here. It's stored with your code, so exceeding allocation is quite dangerous as you may accidentally modify the program's code. As it's stored with your code, it's obviously also allocated at compile time. I don't actually know much about memory safety in the data section. Apparently, you can exceed it without the OS complaining, but I know no more as I'm no system hacker and have no dubious purpose for using this for malicious intents. Basically, I have no idea about exceeding allocation in the data section. Hope someone will comment(or answer) about it.

All assembly shown above is compiled C by GCC 11.1 on an Ubuntu machine. It's in C and not C++ to improve readability.

Sprouse answered 14/12, 2021 at 16:50 Comment(3)
"I guess there's a typo in the code and they wanted to make it int x[4]..." - they also said "The for loop defines the size of the array", so it seems like it's not a typo, but they're simply wrong.Emmons
^ Personally, it's that latter quote ("The for loop defines the size of the array") that jumps out at me as the most wrong statement in the instructor solution. It doesn't even make any sense at all.Loads
@DanielR.Collins What does that even mean? Does it mean that the array is like a list, to which data is added in each iteration? What the.....?Sprouse
W
6

The element size is not defined by the array initialization. The for loop defines the size of the array, which happens to exceed the number of initialized elements, thereby defaulting to zero for the last element.

This is flat-out incorrect. From section 11.6.1p5 of the C++17 standard:

An array of unknown bound initialized with a brace-enclosed initializer-list containing n initializer-clauses, where n shall be greater than zero, is defined as having n elements (11.3.4). [ Example:

int x[] = { 1, 3, 5 };

declares and initializes x as a one-dimensional array that has three elements since no size was specified and there are three initializers. — end example ]

So for an array without an explicit size, the initializer defines the size of the array. The for loop reads past the end of the array, and doing so triggers undefined behavior.

The fact that 0 is printing for the non-existent 4th element is just a manifestation of undefined behavior. There's no guarantee that that value will be printed. In fact, when I run this program I get 3 for the last value when I compile with -O0 and 0 when compiling with -O1.

Waitress answered 16/12, 2021 at 4:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.