Storage of String Literals in memory c++
Asked Answered
T

1

3

I read that string literals are always stored in read only memory and it makes sense as to why.

However if I initialize a character array using a string literal, it still stores the string literal in read only memory and then copies it into the memory location of the character array.

My question is, in this scenario, why bother storing the string literal in read only memory in the first place, why not directly store it in the memory location of character array.

Tenorrhaphy answered 12/6, 2023 at 12:1 Comment(10)
Because literal pooling.Eggers
Why? You need a source for the initialization of the character array with a string. char array might be a local variable in a function, you must initialize it each time the function is called.Pulverulent
I understand you need a source to initialize a character array however how do we initialize local variables that store primitive types? We do not store the primitive type in the const data segment and then copy it over to the memory location of the variable. So my question is why don't we do the same for string literals in the case mentioned in my question.Tenorrhaphy
@Eggers aah that makes some sense. However the literal pool in c++ is stored in the data segment. However when I check the compiled assembly I see the string literal stored in the constant data segment. I am a bit confused hereTenorrhaphy
In addition to the previous comments, I would also point out that the only way of not storing the characters in read-only memory and initializing a string at run time at a memory location unknown at compile time (i.e. most locations, especially with -fPIC) would be to store the string in instructions’ “immediate” (payload) values. But that doesn’t really differ too much from storing it in read-only memory… In fact it’s almost no different, some low-level technical stuff (i-cache, d-cache) aside.Optics
I don't think this is entirely accurate. From the view of C++, there is no special read-only memory per-se. A string literal has type char const[N] for some positive N. The compiler is free to do with it as it pleases. All that C++ says is, "don't write to this object".Thrawn
I read that string literals are always stored in read only memory there is no such requirement it is compiler author choice. There are scenarios where string literal can be overridden since it initial value is not in used at runtime, like this global mutable variable: char foo[10] = "FooFoo";Stout
I read that string literals are always stored in read only memory -- No. It just means that the string-literal is read-only, and as mentioned, you were not supposed to change it. Where that string-literal is stored is another issue. I remember working with very old compilers (on Windows), where mistakingly changing a string-literal's contents didn't produce a crash. When such programs were ported to another OS (Linux or Unix), then the damage was seen.Hyperploid
Does this answer your question? String LiteralsScirrhous
@JanSchultke it does not. My question is what is the point of storing strings in read only memory and then copying it to a char array when we can directly store the string in the char array? There is no point of keeping it in the read only section of memory. It would only make sense if we were using a const char* or a const char array or something similarTenorrhaphy
S
2

I read that string literals are always stored in read only memory and it makes sense as to why.

The storage location of string literals is implementation-defined. If compilers decide to emit a large string literal, it will usually be located in a read-only section of static memory, such as .rodata.

However, whether this is even necessary is up to the compiler. Compilers are allowed to optimize your code according to the as-if rule, so if the behavior of the program is the same with the literal being stored elsewhere, or nowhere at all, that is also allowed.

Example 1

int sum() {
    char arr[] = "ab";
    return arr[0] + arr[1];
}

With the following assembly output:

sum():
     mov eax, 195
     ret

In this case, because everything is a compile-time constant, there is no string literal or array at all. The compiler optimized it away and turned our code into return 195; by summing up the two ASCII characters a and b.

Example 2

void consume(const char*);

void short_string() {
    char arr[] = "short str";
    consume(arr);
}
short_string():
        sub     rsp, 24
        movabs  rax, 8391086215229565043
        mov     qword ptr [rsp + 8], rax
        mov     word ptr [rsp + 16], 114
        lea     rdi, [rsp + 8]
        call    consume(char const*)@PLT
        add     rsp, 24
        ret

Once again, no code was emitted that would keep the string in read-only memory, but it also wasn't away optimized completely. The compiler sees that the string short str is very short, so it treats its ASCII bytes as a number 8391086215229565043 and directly movs its memory onto the stack. consume() is called with a pointer to stack memory.

Example 3

void long_string() {
    char arr[] = "Lorem ipsum dolor [...] est laborum.";
    consume(arr);
}
long_string():
        push    rbx
        sub     rsp, 448
        lea     rsi, [rip + .L__const.long_string().arr]
        mov     rbx, rsp
        mov     edx, 446
        mov     rdi, rbx
        call    memcpy@PLT
        mov     rdi, rbx
        call    consume(char const*)@PLT
        add     rsp, 448
        pop     rbx
        ret
.L__const.long_string().arr:
        .asciz  "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

Our string is now much too long to be treated as a number or two. The entire string will now be emitted into static memory, most likely .rodata after linking. It is still helpful for it to exist, because we can use memcpy to copy it from static memory onto the stack when initializing arr.

Conclusion

If you're worried about compilers doing something wasteful here, don't be. Modern compilers are very good at optimizing and deciding which symbols go where, and if they emit a string literal, this is usually because it must exist for some other code to work, or because it makes initialization of an array easier.


See live examples with Compiler Explorer

Scirrhous answered 12/6, 2023 at 17:31 Comment(8)
Could you explain how it makes the initialization of an array easierTenorrhaphy
@HumblePenguin in Example 3, the array has to exist on the stack, because we are taking its address when calling consume(arr). Thanks to the fact that the array exists in static memory, we sub rsp, 448 to grow the stack by 448 bytes, and then call memcpy to copy the 448 string literal bytes onto the stack. If we didn't have this string literal in static memory, we would have to emit A LOT of code to put it there, not just a function call to memcpy.Scirrhous
I appreciate the answer. It is much more clear now. Thank you!Tenorrhaphy
Another case worth mentioning is static char long_string[] = "..."; or global variables: they can just live in read+write .data, same as any other non-const array in static storage. Also, none of this is specific to string literals in the source. We'd have the same effect with int foo[] = {1,2,3,4,...,99}; and compilers choosing to memcpy from .rodata or store immediates if the array needs to live on the stack, otherwise having other options. (@HumblePenguin)Bookbindery
I wanted to use this as a duplicate for x64 Assembly reverse string function (which segfaults because it calls asm_reverse("my string")), but this answer never uses a string literal as anything other than an array initializer, allowing the compiler to optimize away the literal so it doesn't appear in .rodata. (Apparently in technical terms, it is still a string literal when used as an initializer: en.cppreference.com/w/cpp/language/string_literal)Bookbindery
String literals: Where do they go? looks like a better duplicate for the kind of question I was looking at.Bookbindery
@PeterCordes did you consider example 3? I'm pretty sure that the string literal in that case would still end up in .rodata. In the first two examples, it does get optimized into individual mov instructions straight onto the stack, or simplified away entirely.Scirrhous
That's true, .L__const.long_string().arr: is the actual string literal in .rodata. But the C isn't passing the address of the string literal to any other function, so consume() wouldn't segfault if it wrote the array you pass it, unlike consume("hello"). Beginners that don't understand that distinction won't get it from this answer. Which is fine if that's not what this old question is about; it just came up in my search results ahead of ones that do teach the difference between char *p = "..." vs. char arr[] = "..."; perhaps because I'd looked at it more recently and upvoted.Bookbindery

© 2022 - 2025 — McMap. All rights reserved.