Are char arrays guaranteed to be null terminated?
Asked Answered
B

4

55
#include <stdio.h>

int main() {
    char a = 5;
    char b[2] = "hi"; // No explicit room for `\0`.
    char c = 6;

    return 0;
}

Whenever we write a string, enclosed in double quotes, C automatically creates an array of characters for us, containing that string, terminated by the \0 character http://www.eskimo.com/~scs/cclass/notes/sx8.html

In the above example b only has room for 2 characters so the null terminating char doesn't have a spot to be placed at and yet the compiler is reorganizing the memory store instructions so that a and c are stored before b in memory to make room for a \0 at the end of the array.

Is this expected or am I hitting undefined behavior?

Blende answered 13/9, 2021 at 12:18 Comment(12)
Re the sentence about the sequence of storage: it is not "making room for the terminator". There isn't one, and the compiler is free to store the variables in any way it chooses.Laski
The string literal is created, and that string literal contains the null terminator. At run time, the array b is initialized with the first 2 characters from the string literal, but does not contain the null terminator. (b is not a string).Elmaelmajian
A string in C is a NULL terminated char-array, so if it's not NULL-terminated it's not a string... just a char-array. Many of the string-functions looks for the NULL-character (eg. to know when it should stop copying characters from one string to another), so without it, they won't work correctly (eg. keep copying characters until they encounter some random NULL-character somewhere in memory).Ganister
Not asking the same question exactly, but answers this question completely (does that count as a dupe?) because they're both based on the same confusion: How to initialize a char array without the null terminator?Ultrastructure
This isn't a "string array", that would be char *array_of_strings[] = {"hi", "mom"};. You can call it a string (if it has a 0 terminator, aka ASCII nul (not NULL, @Baard)), or you can call it a char array.Megrim
The last edit by @PeterCordes completely changes the question. You're trying to answer the question (that is, OPs confusion) by editing the title.Chenopod
@pipe: I think what you're getting at is that in the general case there can of course be char arrays that aren't 0-terminated, they're just arrays. That wasn't my motivation for editing, though, and I don't think that's a problem. A more specific (but accurate) title would "can string-initialized char arrays ...", now that you've pointed out that the title I chose isn't ideal to capture the problem. But I'm hesitant to make it too technically worded, and the question body still asks a clear and specific question compatible with the title, which isn't inherently answered by the title.Megrim
@pipe: I'm not totally opposed to changing back to the orig title, but I think this one would be easier to find for most future searchers (e.g. c char array initialized string literal zero terminated is what I'd do if I was wondering... which found some SO questions that might be duplicates). I still like this title better. (Although not as much as before you made me think about it more carefully :/) "string array" seemed like a wrong title, though, like argv[] arrays terminated with NULL pointers. Maybe there's a 3rd option we'd both like.Megrim
e.g. How to initialize a char array without the null terminator? is a less-beginner version of this, with the complication that it's inside a struct. (Which doesn't actually matter.) Also Is initializing a char[] with a string literal bad practice? covers this as a reason not to use an explicit size that matches the string literal, for actual text that might be edited to a different string in future versions.Megrim
@PeterCordes No the name of the character is actually Null ! it's abbreviation - as it's shown in the ASCII-table - is NUL. The same way as the ASCII 1 is named "Start of Heading" but is shown as SOH, and ASCII 21 "Negative Acknowledgment" is shown as NAK. Regardless the backslash 0 character in C, is called the Null character. en.wikipedia.org/wiki/C0_and_C1_control_codesGanister
@BaardKopperud: Thanks, that makes sense, I'd wondered where a name like NUL came from. The character still shouldn't be referred to as all-caps NULL, especially in a C context where NULL is a null-pointer constant, which may or may not be defined as a plain integer literal like 0 or 0LL. It's not wrong to write "null-terminated char array", but I prefer to write "0-terminated" when I'm talking about char or other integer types, only using the word null at all to talk about pointers. A special term makes more sense for pointers since the object-representation may not be all-0.Megrim
Maybe duplicate of No compiler error when fixed size char array is initialized without enough room for null terminatorHurff
L
50

It is allowed to initialize a char array with a string if the array is at least large enough to hold all of the characters in the string besides the null terminator.

This is detailed in section 6.7.9p14 of the C standard:

An array of character type may be initialized by a character string literal or UTF−8 string literal, optionally enclosed in braces. Successive bytes of the string literal (including the terminating null character if there is room or if the array is of unknown size) initialize the elements of the array.

However, this also means that you can't treat the array as a string since it's not null terminated. So as written, since you're not performing any string operations on b, your code is fine.

What you can't do is initialize with a string that's too long, i.e.:

char b[2] = "hello";

As this gives more initializers than can fit in the array and is a constraint violation. Section 6.7.9p2 states this as follows:

No initializer shall attempt to provide a value for an object not contained within the entity being initialized.

If you were to declare and initialize the array like this:

char b[] = "hi"; 

Then b would be an array of size 3, which is large enough to hold the two characters in the string constant plus the terminating null byte, making b a string.

To summarize:

If the array has a fixed size:

  • If the string constant used to initialize it is shorter than the array, the array will contain the characters in the string with successive elements set to 0, so the array will contain a string.
  • If the array is exactly large enough to contain the elements of the string but not the null terminator, the array will contain the characters in the string without the null terminator, meaning the array is not a string.
  • If the string constant (not counting the null terminator) is longer than the array, this is a constraint violation which triggers undefined behavior

If the array does not have an explicit size, the array will be sized to hold the string constant plus the terminating null byte.

Louvar answered 13/9, 2021 at 12:21 Comment(9)
So if there is no room (like my example) the compiler doesn't have to add the null terminating character? Am i supposed to write something like char b[3] = "hi"; to make sure b is treated as a string?Blende
@Blende Yes. But it's better to just write char b[] = "hi";Postimpressionism
Funnily enough, my compiler is complaining about the unused variable: s.c:5:7: warning: unused variable 'b' [-Wunused-variable] and it does add a comment // No explicit room for \0. although that is not a warning in itself.Postimpressionism
@Cheatah: Or better yet (IMHO, of course), to write char *b = "hi";Pearl
@Dan: Fun fact: this is different in C++. It's not legal in C++ to make the explicit size too small to hold a trailing 0 byte if using a double-quoted initializer, so if you want that in C++ you have to write char b[] = {'h', 'i'}; Annoying sometimes for SIMD lookup tables, e.g. static char hex_lut[16] = "0123...ef"; needs a 17th byte or less readable source for the initializer in C++. Example with GCC in C vs. C++ mode, no warnings vs. an error message. godbolt.org/z/eTx94a4h7Megrim
@Cheatah, it's probably not the compiler adding the warning, just showing the line of code where b is defined, and the "no room" text was there in a comment.Bacardi
@Pearl No, that's never better. At least make it const. And as for whether a pointer to const char is better than an array, it obviously depends on the usage, since it can't be modified.Laure
This greatly improved my own understanding of strings. "Quick" question, does the program now use memory for the char[] array AND the string it was initialized with? Is there a case where the char[] simply points to the memory where the initializer string was placed? Or is the string truly duplicated in memory somewhere? Does that still hold for Harvard architecture systems?Claresta
@RDragonrydr When a string literal is used to initialize an array, its contents (up to the array size) are copied to the array. The string literal itself might also appear in memory separately, depending on the individual compiler and if the string constant is used elsewhere.Louvar
A
36

Whenever we write a string, enclosed in double quotes, C automatically creates an array of characters for us, containing that string, terminated by the \0 character.

Those notes are mildly misleading in this case. I shall have to update them.

When you write something like

char *p = "Hello";

or

printf("world!\n");

C automatically creates an array of characters for you, of just the right size, containing the string, terminated by the \0 character.

In the case of array initializers, however, things are slightly different. When you write

char b[2] = "hi";

the string is merely the initializer for an array which you are creating. So you have complete control over the size. There are several possibilities:

char b0[] = "hi";     // compiler infers size
char b1[1] = "hi";    // error
char b2[2] = "hi";    // No terminating 0 in the array. (Illegal in C++, BTW)
char b3[3] = "hi";    // explicit size matches string literal
char b4[10] = "hi";   // space past end of initializer is always zero-initialized

For b0, you don't specify a size, so the compiler uses the string initializer to pick the right size, which will be 3.

For b1, you specify a size, but it's too small, so the compiler should give you a error.

For b2, which is the case you asked about, you specify a size which is just barely big enough for the explicit characters in the string initializer, but not the terminating \0. This is a special case. It's legal, but what you end up with in b2 is not a proper null-terminated string. Since it's unusual at best, the compiler might give you a warning. See this question for more information on this case.

For b3, you specify a size which is just right, so you get a proper string in an exactly-sized array, just like b0.

For b4, you specify a size which is too big, although this is no problem. There ends up being extra space in the array, beyond the terminating \0. (As a matter of fact, this extra space will also be filled with \0.) This extra space would let you safely do something like strcat(b4, ", wrld!").

Needless to say, most of the time you want to use the b0 form. Counting characters is tedious and error-prone. As Brian Kernighan (one of the creators of C) has written in this context, "Let the computer do the dirty work."

One more thing. You wrote:

and yet the compiler is reorganizing the memory store instructions so that a and c are stored before b in memory to make room for a \0 at the end of the array.

I don't know what's going on there, but it's safe to say that the compiler is not trying to "make room for a \0". Compilers can and often do store variables in their own inscrutable internal order, matching neither the order you declared them, nor alphabetical order, nor anything else you might think of. If under your compiler array b ended up with extra space after it which did contain a \0 as if to terminate the string, that was probably basically random chance, not because the compiler was trying to be nice to you and helping to make something like printf("%s\n", b) be better defined. (Under the two compilers where I tried it, printf("%s\n", b) printed hi^E and hi ??, clearly showing the presence of trailing random garbage, as expected.)

Allister answered 13/9, 2021 at 13:8 Comment(2)
The order that a compiler stores variables in is often (though depends on the compiler) done in such a way as to avoid wasted space between variables. A 4-byte or larger variable, and in some cases arrays, must begin on an address which is a multiple of 4, so if you have 2-byte or 1-byte variables around, it may re-order them so as to avoid wasted bytes. (The size of various variable types is obviously also compiler-dependent. In this case, the char is only 1 or 2 bytes in many compilers, so they may be moved so that b can start on a multiple of 4.)Oliviero
Minor nitpick: "C automatically creates...", technically speaking, as C doesn't have the same type of a runtime like an interpreted language has, it's the compiler/optimizer translating from C and outputting asm/obj (i.e. everything happening prior to any execution) while adhering to the syntax specified by the standard(s), so it might not be guaranteed to work in a standard-incomplete compiler (e.g. some golfed CC). When such code is reached in a binary, it's already in a finite state and C doesn't exist at that point anymore.Nicolis
A
6

There are two things in your question.

  1. String literal. String literal (ie something enclosed in the double quotes) is always the correct null character terminated string.

    char *p = "ABC";  // p references null character terminated string
    
  2. Character array may only hold as many elements as it has so if you try to initialize two element array with three elements string literal, only two first will be written. So the array will not contain the null character terminated C string

    char p[2] = "AB";  // p is not a valid C string.
    
Abele answered 13/9, 2021 at 12:34 Comment(0)
G
2

A array of char need not be terminated by anything at all. It is an array. If the actual content is smaller than the dimensions of the array then you need to track the size of that content.

Answers here seem to have degenerated into a string discussion. Not all arrays of char are strings. However it is a very strong convention to use a null terminator as a sentinel if they are to be handled as de facto strings.

Your array may use something else, and may also have separators and zones. After all it may be a Union or overlay a structure. Possibly a staging area for another system.

Grenade answered 16/9, 2021 at 7:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.