C's strtok() and read only string literals
Asked Answered
R

5

6

char *strtok(char *s1, const char *s2)

repeated calls to this function break string s1 into "tokens"--that is the string is broken into substrings, each terminating with a '\0', where the '\0' replaces any characters contained in string s2. The first call uses the string to be tokenized as s1; subsequent calls use NULL as the first argument. A pointer to the beginning of the current token is returned; NULL is returned if there are no more tokens.

Hi,

I have been trying to use strtok just now and found out that if I pass in a char* into s1, I get a segmentation fault. If I pass in a char[], strtok works fine.

Why is this?

I googled around and the reason seems to be something about how char* is read only and char[] is writeable. A more thorough explanation would be much appreciated.

Reformation answered 7/11, 2008 at 17:32 Comment(1)
So, in the char* version, the pointer points to read-only memory. In the char[] version, the array variable is in read-write memory, and the initializer code in the C startup copies the string literal into the array.Threaten
G
16

What did you initialize the char * to?

If something like

char *text = "foobar";

then you have a pointer to some read-only characters

For

char text[7] = "foobar";

then you have a seven element array of characters that you can do what you like with.

strtok writes into the string you give it - overwriting the separator character with null and keeping a pointer to the rest of the string.

Hence, if you pass it a read-only string, it will attempt to write to it, and you get a segfault.

Also, becasue strtok keeps a reference to the rest of the string, it's not reeentrant - you can use it only on one string at a time. It's best avoided, really - consider strsep(3) instead - see, for example, here: http://www.rt.com/man/strsep.3.html (although that still writes into the string so has the same read-only/segfault issue)

Glynis answered 7/11, 2008 at 17:40 Comment(7)
Sorry if this Sounds stupid, but what stops us from saying *(text+3)= 'a' in the char *text = "foobar"; Version ?Reformation
I just tried it. Nothing stops you from doing it except for a segfault because "text + 3" still refers to read-only memory.Threaten
@Paul: strsep is a poor replacement for strtok, it suffers from many of the same problems that strtok does, namely if modifies the string and doesn't work with string literals.Logo
Robert, yes, strsep's poor too. Suggested alternative?Glynis
@Paul: a simple solution is to make a copy of the string first to ensure you have a modifiable version of the string and work with that.Isobar
@Evan, well, yes :) I was really after whether there are any standard-ish library functions to help the OPGlynis
@JJohn and @Gilbert: the memory text points to is not read only. Thats hogwash. You misunderstand pointer arithmetic. text+3 isn't the third character in the string, but a pointer well past the end (3 * sizeof(char*) bytes in fact). What you want is text[3]Voltmer
O
6

An important point that's inferred but not stated explicitly:

Based on your question, I'm guessing that you're fairly new to programming in C, so I'd like to explain a little more about your situation. Forgive me if I'm mistaken; C can be hard to learn mostly because of subtle misunderstanding in underlying mechanisms so I like to make things as plain as possible.

As you know, when you write out your C program the compiler pre-creates everything for you based on the syntax. When you declare a variable anywhere in your code, e.g.:

int x = 0;

The compiler reads this line of text and says to itself: OK, I need to replace all occurrences in the current code scope of x with a constant reference to a region of memory I've allocated to hold an integer.

When your program is run, this line leads to a new action: I need to set the region of memory that x references to int value 0.

Note the subtle difference here: the memory location that reference point x holds is constant (and cannot be changed). However, the value that x points can be changed. You do it in your code through assignment, e.g. x = 15;. Also note that the single line of code actually amounts to two separate commands to the compiler.

When you have a statement like:

char *name = "Tom";

The compiler's process is like this: OK, I need to replace all occurrences in the current code scope of name with a constant reference to a region of memory I've allocated to hold a char pointer value. And it does so.

But there's that second step, which amounts to this: I need to create a constant array of characters which holds the values 'T', 'o', 'm', and NULL. Then I need to replace the part of the code which says "Tom" with the memory address of that constant string.

When your program is run, the final step occurs: setting the pointer to char's value (which isn't constant) to the memory address of that automatically created string (which is constant).

So a char * is not read-only. Only a const char * is read-only. But your problem in this case isn't that char *s are read-only, it's that your pointer references a read-only regions of memory.

I bring all this up because understanding this issue is the barrier between you looking at the definition of that function from the library and understanding the issue yourself versus having to ask us. And I've somewhat simplified some of the details in the hopes of making the issue more understandable.

I hope this was helpful. ;)

Osmo answered 8/11, 2008 at 19:22 Comment(1)
NULL (the null pointer) is different from NUL (ASCII 0). The situation is confusing enough, but since the C macro is NULL with 2 L's, it's best (in my opinion) to refer to ASCII 0 as NUL (or "the null character").Alpinist
S
2

I blame the C standard.

char *s = "abc";

could have been defined to give the same error as

const char *cs = "abc";
char *s = cs;

on grounds that string literals are unmodifiable. But it wasn't, it was defined to compile. Go figure. [Edit: Mike B has gone figured - "const" didn't exist at all in K&R C. ISO C, plus every version of C and C++ since, has wanted to be backward-compatible. So it has to be valid.]

If it had been defined to give an error, then you couldn't have got as far as the segfault, because strtok's first parameter is char*, so the compiler would have prevented you passing in the pointer generated from the literal.

It may be of interest that there was at one time a plan in C++ for this to be deprecated (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/1996/N0896.asc). But 12 years later I can't persuade either gcc or g++ to give me any kind of warning for assigning a literal to non-const char*, so it isn't all that loudly deprecated.

[Edit: aha: -Wwrite-strings, which isn't included in -Wall or -Wextra]

Shaffer answered 7/11, 2008 at 19:34 Comment(2)
The const keyword was not in K&R C. Having millions (billions?) of existing lines of char* s = "abc"; suddenly be invalid would have surely slowed down (if not stopped) adoption of ANSI/ISO C. Even trying to change it today would face similar opposition (as it appears you're finding).Protection
That explains the historical reasons, thanks. I guess my complaint then is that modern C and C++ compilers should be warning, perhaps at a very high warn level initially. I don't mind a few warnings from K&R era code, but it would be nice if new code was discouraged from doing it.Shaffer
F
0

In brief:

char *s = "HAPPY DAY";
printf("\n %s ", s);

s = "NEW YEAR"; /* Valid */
printf("\n %s ", s);

s[0] = 'c'; /* Invalid */
Fulmar answered 21/2, 2009 at 1:21 Comment(0)
S
0

If you look at your compiler documentation, odds are there is a option you can set to make those strings writable.

Sinhalese answered 21/2, 2009 at 1:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.