Why do (only) some compilers use the same address for identical string literals?
Asked Answered
G

4

94

https://godbolt.org/z/cyBiWY

I can see two 'some' literals in assembler code generated by MSVC, but only one with clang and gcc. This leads to totally different results of code execution.

static const char *A = "some";
static const char *B = "some";

void f() {
    if (A == B) {
        throw "Hello, string merging!";
    }
}

Can anyone explain the difference and similarities between those compilation outputs? Why does clang/gcc optimize something even when no optimizations are requested? Is this some kind of undefined behaviour?

I also notice that if I change the declarations to those shown below, clang/gcc/msvc do not leave any "some" in the assembler code at all. Why is the behaviour different?

static const char A[] = "some";
static const char B[] = "some";
Grosvenor answered 15/10, 2018 at 10:17 Comment(8)
https://mcmap.net/q/15930/-is-storage-for-the-same-content-string-literals-guaranteed-to-be-the-same Some nice relevant answer to a closely related question, with standard quotes.Vienne
@Vienne I discuss compiler flags that effect this hereIntegrator
For MSVC, the /GF compiler option controls this behavior. See learn.microsoft.com/en-us/cpp/build/reference/…Berlinda
FYI, this can happen for functions too.Recalcitrant
How Do C++ Compilers Merge Identical String LiteralsRoundelay
Also works for "some string" and "string" on some compilers.Yeoman
Interestingly the behaviour seems to be the same under GCC under multiple different optimisations!Borchert
Possible duplicate of Is storage for the same content string literals guaranteed to be the same?Leucocratic
M
113

This is not undefined behavior, but unspecified behavior. For string literals,

The compiler is allowed, but not required, to combine storage for equal or overlapping string literals. That means that identical string literals may or may not compare equal when compared by pointer.

That means the result of A == B might be true or false, on which you shouldn't depend.

From the standard, [lex.string]/16:

Whether all string literals are distinct (that is, are stored in nonoverlapping objects) and whether successive evaluations of a string-literal yield the same or a different object is unspecified.

Machiavellian answered 15/10, 2018 at 10:20 Comment(1)
For the meaning of undefined behavior vs unspecified behavior see here: Undefined, unspecified and implementation-defined behaviorSchlegel
M
36

The other answers explained why you cannot expect the pointer addresses to be different. Yet you can easily rewrite this in a way that guarantees that A and B don't compare equal:

static const char A[] = "same";
static const char B[] = "same";// but different

void f() {
    if (A == B) {
        throw "Hello, string merging!";
    }
}

The difference being that A and B are now arrays of characters. This means that they aren't pointers and their addresses have to be distinct just like those of two integer variables would have to be. C++ confuses this because it makes pointers and arrays seem interchangeable (operator* and operator[] seem to behave the same), but they are really different. E.g. something like const char *A = "foo"; A++; is perfectly legal, but const char A[] = "bar"; A++; isn't.

One way to think about the difference is that char A[] = "..." says "give me a block of memory and fill it with the characters ... followed by \0", whereas char *A= "..." says "give me an address at which I can find the characters ... followed by \0".

Monandrous answered 16/10, 2018 at 2:12 Comment(5)
This would be an even better answer if you could explain why it's different.Soubriquet
Note that *p and p[0] not only "seem to behave the same" but by definition are identical (provided that p+0 == p is an identity relation because 0 is the neutral element in pointer-integer addition). After all, p[i] is defined as *(p+i). The answer makes a good point though.Wilbertwilborn
typeof(*p) and typeof(p[0]) are both char so there's really not much left that could be different. I do agree that 'seem to behave the same' is not the best wording, because the semantics are so different. Your post reminded me of the best way to access elements of C++ arrays: 0[p], 1[p], 2[p] etc. This is how the pros do it, at least when they want to confuse people who were born after the C programming language.Monandrous
Related: Why do I get a segmentation fault when writing to a string initialized with “char *s” but not “char s[]”?Jaundice
This is interesting, and I was tempted to add a link to the C FAQ, but I realized that there are lots of related questions, but none seem to cut right to the point of this question here.Monandrous
A
23

Whether or not a compiler chooses to use the same string location for A and B is up to the implementation. Formally you can say that the behaviour of your code is unspecified.

Both choices implement the C++ standard correctly.

Ardel answered 15/10, 2018 at 10:20 Comment(1)
The behavior of the code is to either throw an exception, or do nothing, chosen, prior to the first time the code is executed, in unspecified fashion. That doesn't mean the behavior as a whole is unspecified--merely that the compiler can select either behavior in any manner it sees fit prior the first time the behavior is observed.Haymo
M
4

It is an optimization to save space, often called "string pooling". Here is the docs for MSVC:

https://msdn.microsoft.com/en-us/library/s0s0asdt.aspx

Therefore if you add /GF to the command line you should see the same behavior with MSVC.

By the way you probably shouldn't be comparing strings via pointers like that, any decent static analysis tool will flag that code as defective. You need to compare what they point to, not the actual pointer values.

Misfire answered 16/10, 2018 at 20:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.