Is strtok broken? Or just tricky?
Asked Answered
F

1

-2

Is strtok hopelessly broken?

On many StackOverflow questions about text-parsing in C, someone will suggest using strtok, and one common reply is that strtok should never be used, that it is hopelessly broken.

Some posters have claimed that strtok's problems are limited to multi-threading issues, and it is safe in a single-threaded environment.

What is the right answer?
Does it work?
Is it hopelessly broken?
Can you back up your answer with examples?

Fugger answered 18/2, 2015 at 16:7 Comment(9)
it's not "broken". It has a well-defined behavior. In certain cases, that behavior will cause issues, but those cases are not what the function was designed for. A more accurate statement, is that strtok is often used wrong.Brilliant
@Sander: See the example below. Every bit of code is written properly and safely, and strtok still leads to Undefined Behavior.Fugger
no, your code is invalid. You're treating strtok as if it were re-entrant. But it isn't.Brilliant
@Sander: Re-entrancy is when a function is invoked when another call already exists on the callstack (similar to a recursive call). In my code, there is only ever one call to strtok on the stack at a time. This is not a re-entrancy problem.Fugger
Subsequent calls to strtok for the same string are generally treated as if they're one call to strtok that gets interrupted, and re-entered later (because the two are conceptually the same). But if you prefer, I'll phrase my earlier comment differently : the same reason why strtok is not re-entrant (it keeps global state), is the reason why your code is invalid.Brilliant
If you want to think of it like that, then you can put the blame on my code instead of strtok. But that is not the semantics of functions in C. Grouping multiple separate calls to strtok together, and treating them as a single-unit is NOT how the C-language works. Imagine if you could not call a function like printf without also tracking where other calls to printf occur, because they might interfere with each other? With strtok, you need to not only track your own use of the function, but any other use anywhere in the program!Fugger
Assume that the function GetSentenceStats was written by the Sentence-team: They wrote unit-tests, they verified its behavior, and delivered their finished product to another team. The Paragraph-team does not know that the Sentence-team used strtok. The Paragraph-team wrote their function to the specification, and called the Sentence-function properly according to the spec. Every team followed their obligations, and the program still does not work.Fugger
the point I was making, is that claiming that strtok is broken because you can't use it in a way that it was never designed for, makes no sense. Sure, strtok can only be safely used under certain well-defined conditions (which is why they added strtok_s later). But under those well-defined conditions, strtok is good at what it does, and is not broken.Brilliant
Let us continue this discussion in chat.Fugger
F
2

Yes, strtok is hopelessly broken, even in a simple single-threaded program, and I will demonstrate this failure with some sample code:

Let us begin with a simple text-analyzer function to gather statistics about sentences of text, using strtok. This code will lead to undefined behavior.

In this example, a sentence is a set of words separated by spaces, commas, semi-colons, and periods.

// Example:
//     int words, longest;
//     GetSentenceStats("There were a king with a large jaw and a queen with a plain face, on the throne of England.", &words, &longest);
// will report there are 20 words, and the longest word has 7 characters ("England").
void GetSentenceStats(const char* sentence, int* pWordCount, int* pMaxWordLen)
{
    char* delims = " ,;.";           // In a sentence, words are separated by spaces, commas, semi-colons or period.
    char* input = strdup(sentence);  // Make an local copy of the sentence, to be modified without affecting the caller.

    *pWordCount = 0;                 // Initialize the output to Zero
    *pMaxWordLen = 0;

    char* word = strtok(input, delims);
    while(word)
    {
        (*pWordCount)++;
        *pMaxWordLen = MAX(*pMaxWordLen, (int)strlen(word));
        word = strtok(NULL, delims);
    }
    free(input);
}

This simple function works. There are no bugs so far.


Now let us augment our library to add a function that gathers stats on Paragraphs of text.
A paragraph is a set of sentences separated by Exclamation Marks, Question Marks and Periods.

It will return the number of sentences in the paragraph, and the number of words in the longest sentence.
And perhaps most importantly, it will use the earlier function GetSentenceStats to help

void GetParagraphStats(const char* paragraph, int* pSentenceCount, int* pMaxWords)
{
    char* delims = ".!?";             // Sentences in a paragraph are separated by Period, Question-Mark, and Exclamation.
    char* input = strdup(paragraph);  // Make an local copy of the paragraph, to be modified without affecting the caller.

    *pSentenceCount = 0;
    *pMaxWords = 0;
    char* sentence = strtok(input, delims);
    while(sentence)
    {
        (*pSentenceCount)++;

        int wordCount;
        int longestWord;
        GetSentenceStats(sentence, &wordCount, &longestWord);
        *pMaxWords = MAX(*pMaxWords, wordCount);
        sentence = strtok(NULL, delims);    // This line returns garbage data, 
    }
    free(input);
}

This function also looks very simple and straightforward.
But it does not work, as demonstrated by this sample program.

int main(void)
{
    int cnt;
    int len;

    // First demonstrate that the SentenceStats function works properly:
    char *sentence = "There were a king with a large jaw and a queen with a plain face, on the throne of England."; 
    GetSentenceStats(sentence, &cnt, &len);
    printf("Word Count: %d\nLongest Word: %d\n", cnt, len);
    // Correct Answer:
    // Word Count: 20
    // Longest Word: 7   ("England")


    printf("\n\nAt this point, expected output is 20/7.\nEverything is working fine\n\n");

    char paragraph[] =  "It was the best of times!"   // Literary purists will note I have changed Dicken's original text to make a better example
                        "It was the worst of times?"
                        "It was the age of wisdom."
                        "It was the age of foolishness."
                        "We were all going direct to Heaven!";
    int sentenceCount;
    int maxWords;
    GetParagraphStats(paragraph, &sentenceCount, &maxWords);
    printf("Sentence Count: %d\nLongest Sentence: %d\n", sentenceCount, maxWords);
    // Correct Answer:
    // Sentence Count: 5
    // Longest Sentence: 7  ("We were all going direct to Heaven")

    printf("\n\nAt the end, expected output is 5/7.\nBut Actual Output is Undefined Behavior! Strtok is hopelessly broken\n");
    _getch();
    return 0;
}

All calls to strtok are entirely correct, and are on separate data.
But the result is Undefined Behavior!

Why does this happen?
When GetParagraphStats is called, it begins a strtok-loop to get sentences. On the first sentence it will call GetSentenceStats. GetSentenceStats will also being a strtok-loop, losing all state established by GetParagraphStats. When GetSentenceStats returns, the caller (GetParagraphStats) will call strtok(NULL) again to get the next sentence.

But strtok will think this is a call to continue the previous operation, and will continue tokenizing memory that has now been freed! The result is the dreaded Undefined Behavior.

When is it safe to use strtok?
Even in a single-threaded environment, strtok can only be used safely when the programmer/architect is sure of two conditions:

  • The function using strtok must never call any function that may also use strtok.
    If it calls a subroutine that also uses strtok, its own use of strtok may be interrupted.

  • The function using strtok must never be called by any function that may also use strtok.
    If this function ever called by another routine using strtok, then this function will interrupt the callers use of strtok.

In a multi-threaded environment, use of strtok is even more impossible, because the programmer needs to be sure that there is only one use of strtok on the current thread, and also, no other threads are using strtok either.

Fugger answered 18/2, 2015 at 16:7 Comment(15)
So strtok uses global state? It really sounds like a singleton object.Slipknot
@Slipknot AFAIK not global, it's a static variable in the function.Desiccator
That is why we have strtok_s(), so you can work with different strings at the same time.Tritium
@WeatherVane : since C11. On older version, you need to explicitely feed the input sentence : sentence = strtok(sentence + (int)strlen(sentence) + 1, delims);Schnapp
Why did this get downvotes? Is any of it wrong? Looks resonable to me, (unlike strtok static state - ugh!).Dalmatic
@georgesl since MSVC 9.0 is free, and contains the "safer" version of many functions, there is no point using an older version.Tritium
@MartinJames: Thank you. :) All I can say is "haters gonna hate". I put a lot of thought and time into making that example as clear and precise as I could.Fugger
@WeatherVane: On linux, you need to use strtok_r, which is in Posix but not in C11. strtok_s is in the optional Annex K of C11, which glibc doesn't currently implement. So it's not quite that simple.Novellanovello
Strtok is a table saw with all the guards and safeties removed. It's dangerous and something to avoid, but it has no incompetence or malicious intent. The people who use strtok, however, I tend to judge much more harshly.Territorialism
@KennetBelenky perhaps we should avoid C altogether then. strtok() is not the only function to use static data internal to the function. A call to ctime() destroys the results of any previous call to itself and to asctime() and localtime(). But this was an interesting question and self-answer: upvoted.Tritium
This whole discussion seems to have taken on the feel of a religious war. Any technically relevant statements against your idea (eg. here, and arguments offered by @Sander De Dycker above, ) have been met with your opinions on the implementation of strtok(). If you have such strong feelings about the implementation, can you not bring it up to the C standards committee?Amylase
@WeatherVane I made my comment about judging people because the chain of events that culminates with a person using strtok probably contains a human error or deficiency. In today's ecosystem, the number of places where C is used is much greater than the number of places where C should be used. From simple probabilities, if a person is using strtok, it is likely that they should not.Territorialism
OMG! Maybe the function calling strtok() calls another function which also calls strtok() before the first function has finished with strtok() but also calls the first function that calls strtok() before the second function that called strtok() has finished with strtok(), and if both are also playing with the same files or pointers too it is no more dangerous than slicing two different fruits or even digits in the kitchen with a sharp knife, as @iharob one implied. Only a ... blames the tools.Tritium
@WeatherVane yes strtok is not reentrant.Desiccator
@iharob, maybe recursion will be written out of a language, and indeed in some houses it is forbidden.Tritium

© 2022 - 2024 — McMap. All rights reserved.