Can a string literal in C be modified?
Asked Answered
P

5

6

I recently had a question, I know that a pointer to a constant array initialized as it is in the code below, is in the .rodata region and that this region is only readable. However, I saw in pattern C11, that writing in this memory address behavior will be undefined. I was aware that the Borland's Turbo-C compiler can write where the pointer points, this would be because the processor operated in real mode on some systems of the time, such as MS-DOS? Or is it independent of the operating mode of the processor? Is there any other compiler that writes to the pointer and does not take any memory breach failure using the processor in protected mode?

#include <stdio.h>

int main(void) {
    char *st = "aaa";
    *st = 'b'; 
    return 0;
}

In this code compiling with Turbo-C in MS-DOS, you will be able to write to memory

Passible answered 24/6, 2019 at 20:59 Comment(10)
You should never do this. On some systems, the string "aaa" is placed in read-only memory, which will result in a run-time error if you try to modify it. Also, if you have another instance of the string "aaa" in the same compilation unit, they might share the same storage, in which case changing one will change the other.Hankering
It is a function of the system you are running on, and, if it's embedded system and the code & string literal are in ROM it will be physically impossible to write to the string literal, never mind any memory protection.Markettamarkey
What is the point of the question? The behavior is undefined. What does it matter if there are compilers with which it consistestently manifests one way vs. others?Ashmead
@JohnBollinger I want to know if this has to do with memory protection of the modern x86 and Intel® 64 processors? Or if this is indifferent. Is there a compiler that writes to the memory address in protected mode?Passible
I don't follow, @Yuri. The behavior is undefined. Compilers are not required to ensure that such write attempts fail, on any platform, historic or modern. Those that do so are taking advantage of the behavior's undefinedness, as failing hard and fast in such situations is usually the safest thing to do. They implement it however the host system permits, and certainly such implementations are not limited to Intel-based systems, though Intel chips do offer mechanisms to achieve it.Ashmead
On the x86 platform, using protected mode. The compiler may or may not protect that memory area where the pointer st points, right?Passible
Although this behavior is undefined by the C standard, some compilers may define the behavior, in which case you can rely on it, if you are using such a compiler. One example of such a compiler is gcc prior to version 4.0 with the -fwritable-strings option. The Turbo C compiler may also define this behavior, but I'm not sure. Note that just because it doesn't fault doesn't mean that the behavior is defined. You would have to read the compiler documentation.Feeling
Thanks for the comment, I went searching for this flag, including the gcc 4.0.4 manual and found nothing about it :(Passible
Turbo-C predated the official release of ANSI-C, however they did implement most of ANSI-C. K&R C (which pre-dated ANSI-C) didn't say that writing to string literals wasn't allowed (there was no concept of const in K&R C). By default Turbo-C didn't merge duplicate string literals which allowed them to be written to without clobbering the string literal for someone else (this allowed code written for K&R C to work as expected). You could turn duplicate merging on with the -d option at which point you don't want to be writing them and they should be considered pointers to constant data.Heinrick
Turbo-C's implementation of C was somewhere between K&R C and ANSI-C.Heinrick
M
7

Is there any other compiler that writes to the pointer and does not take any memory breach failure using the processor in protected mode?

GCC 3 and earlier used to support gcc -fwriteable-strings to let you compile old K&R C where this was apparently legal, according to https://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Incompatibilities.html. (It's undefined behaviour in ISO C and thus a bug in an ISO C program). That option will define the behaviour of the assignment which ISO C leaves undefined.

GCC 3.3.6 manual - C Dialect options

-fwritable-strings
Store string constants in the writable data segment and don't uniquize them. This is for compatibility with old programs which assume they can write into string constants.

Writing into string constants is a very bad idea; “constants” should be constant.

GCC 4.0 removed that option (release notes); the last GCC3 series was gcc3.4.6 in March 2006. Although apparently it had become buggy in that version.

gcc -fwritable-strings would treat string literals like non-const anonymous character arrays (see @gnasher's answer), so they go in the .data section instead of .rodata, and thus get linked into a segment of the executable that's mapped to read+write pages, not read-only. (Executable segments have basically nothing to do with x86 segmentation, it's just a start+range memory-mapping from the executable file to memory.)

And it would disable duplicate-string merging, so char *foo() { return "hello"; } and char *bar() { return "hello"; } would return different pointer values, instead of merging identical string literals.


Related:


Linker option: still Undefined Behaviour so probably not viable

On GNU/Linux, linking with ld -N (--omagic) will make the text (as well as data) section read+write. This may apply to .rodata even though modern GNU Binutils ld puts .rodata in its own section (normally with read but not exec permission) instead of making it part of .text. Having .text writeable could easily be a security problem: you never want a page with write+exec at the same time, otherwise some bugs like buffer overflows can turn into code-injection attacks.

To do this from gcc, use gcc -Wl,-N to pass on that option to ld when linking.

This doesn't do anything about it being Undefined Behaviour to write const objects. e.g. the compiler will still merge duplicate strings, so writing into one char *foo = "hello"; will affect all other uses of "hello" in the whole program, even across files.


What to use instead:

If you want something writeable, use static char foo[] = "hello"; where the quoted string is just an array initializer for a non-const array. As a bonus, this is more efficient than static char *foo = "hello"; at global scope, because there's one fewer level of indirection to get to the data: it's just an array instead a pointer stored in memory.

Mercuri answered 27/6, 2019 at 6:57 Comment(0)
H
9

As has been pointed out, trying to modify a constant string in C results in undefined behavior. There are several reasons for this.

One reason is that the string may be placed in read-only memory. This allows it to be shared across multiple instances of the same program, and doesn't require the memory to be saved to disk if the page it's on is paged out (since the page is read-only and thus can be reloaded later from the executable). It also helps detect run-time errors by giving an error (e.g. a segmentation fault) if an attempt is made to modify it.

Another reason is that the string may be shared. Many compilers (e.g., gcc) will notice when the same literal string appears more than once in a compilation unit, and will share the same storage for it. So if a program modifies one instance, it could affect others as well.

There is also never a need to do this, since the same intended effect can easily be achieved by using a static character array. For instance:

#include <stdio.h>

int main(void) {
    static char st_arr[] = "aaa";
    char *st = st_arr;
    *st = 'b'; 
    return 0;
}

This does exactly what the posted code attempted to do, but without any undefined behavior. It also takes the same amount of memory. In this example, the string "aaa" is used as an array initializer, and does not have any storage of its own. The array st_arr takes the place of the constant string from the original example, but (1) it will not be placed in read-only memory, and (2) it will not be shared with any other references to the string. So it's safe to modify it, if in fact that's what you want.

Hankering answered 24/6, 2019 at 22:19 Comment(3)
Thankfully, for having formulated this answer, in fact it's just a curiosity that I have, if I had to change the content where the pointer points, I would use it that way you did.Passible
@YuriAlbuquerque Oops, I left out the static keyword - I just added it. Sorry about that!Hankering
No problems, :)Passible
M
7

Is there any other compiler that writes to the pointer and does not take any memory breach failure using the processor in protected mode?

GCC 3 and earlier used to support gcc -fwriteable-strings to let you compile old K&R C where this was apparently legal, according to https://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Incompatibilities.html. (It's undefined behaviour in ISO C and thus a bug in an ISO C program). That option will define the behaviour of the assignment which ISO C leaves undefined.

GCC 3.3.6 manual - C Dialect options

-fwritable-strings
Store string constants in the writable data segment and don't uniquize them. This is for compatibility with old programs which assume they can write into string constants.

Writing into string constants is a very bad idea; “constants” should be constant.

GCC 4.0 removed that option (release notes); the last GCC3 series was gcc3.4.6 in March 2006. Although apparently it had become buggy in that version.

gcc -fwritable-strings would treat string literals like non-const anonymous character arrays (see @gnasher's answer), so they go in the .data section instead of .rodata, and thus get linked into a segment of the executable that's mapped to read+write pages, not read-only. (Executable segments have basically nothing to do with x86 segmentation, it's just a start+range memory-mapping from the executable file to memory.)

And it would disable duplicate-string merging, so char *foo() { return "hello"; } and char *bar() { return "hello"; } would return different pointer values, instead of merging identical string literals.


Related:


Linker option: still Undefined Behaviour so probably not viable

On GNU/Linux, linking with ld -N (--omagic) will make the text (as well as data) section read+write. This may apply to .rodata even though modern GNU Binutils ld puts .rodata in its own section (normally with read but not exec permission) instead of making it part of .text. Having .text writeable could easily be a security problem: you never want a page with write+exec at the same time, otherwise some bugs like buffer overflows can turn into code-injection attacks.

To do this from gcc, use gcc -Wl,-N to pass on that option to ld when linking.

This doesn't do anything about it being Undefined Behaviour to write const objects. e.g. the compiler will still merge duplicate strings, so writing into one char *foo = "hello"; will affect all other uses of "hello" in the whole program, even across files.


What to use instead:

If you want something writeable, use static char foo[] = "hello"; where the quoted string is just an array initializer for a non-const array. As a bonus, this is more efficient than static char *foo = "hello"; at global scope, because there's one fewer level of indirection to get to the data: it's just an array instead a pointer stored in memory.

Mercuri answered 27/6, 2019 at 6:57 Comment(0)
B
4

You are asking whether or not the platform may cause undefined behavior to be defined. The answer to that question is yes.

But you are also asking whether or not the platform defines this behavior. In fact it does not.

Under some optimization hints, the compiler will merge string constants, so that writing to one constant will write to the other uses of that constant. I used this compiler once, it was quite capable of merging strings.

Don't write this code. It's not good. You will regret writing code in this style when you move onto a more modern platform.

Bogoch answered 24/6, 2019 at 21:9 Comment(0)
F
3

Your literal "aaa" produces a static array of four const char 'a', 'a', 'a', '\0' in an anonymous location and returns a pointer to the first 'a', cast to char*.

Trying to modify any of the four characters is undefined behaviour. Undefined behaviour can do anything, from modifying the char as intended, pretending to modify the char, doing nothing, or crashing.

It's basically the same as static const char anonymous[4] = { 'a', 'a', 'a', '\0' }; char* st = (char*) &anonymous [0];

Fauch answered 24/6, 2019 at 21:7 Comment(0)
A
3

To add to the correct answers above, DOS runs in real mode, so there is no read only memory. All memory is flat and writable. Hence, writing to the literal was well defined (as it was in any sort of const variable) at the time.

Allbee answered 24/6, 2019 at 22:19 Comment(3)
Yes, that's one of the reasons I wanted to know. For example on the intel 8086 processor, I would not be able to have the compiler protect the memory region, correct?Passible
Yes, real mode cannot protect memory.Allbee
Real mode code running in v8086 mode(which was how EMM386 and other expanded memory managers operated to emulate expanded memory) could in theory allowed a run-time environment (with something like VCPI or DPMI) to mark pages of memory read-only. DOS multitaskers would for example mark video memory read-only so that writes could be trapped allowing screen access to be virtualized (this of course was transparent to DOS),Heinrick

© 2022 - 2024 — McMap. All rights reserved.