Yes, strtok
is hopelessly broken, even in a simple single-threaded program, and I will demonstrate this failure with some sample code:
Let us begin with a simple text-analyzer function to gather statistics about sentences of text, using strtok
.
This code will lead to undefined behavior.
In this example, a sentence is a set of words separated by spaces, commas, semi-colons, and periods.
// Example:
// int words, longest;
// GetSentenceStats("There were a king with a large jaw and a queen with a plain face, on the throne of England.", &words, &longest);
// will report there are 20 words, and the longest word has 7 characters ("England").
void GetSentenceStats(const char* sentence, int* pWordCount, int* pMaxWordLen)
{
char* delims = " ,;."; // In a sentence, words are separated by spaces, commas, semi-colons or period.
char* input = strdup(sentence); // Make an local copy of the sentence, to be modified without affecting the caller.
*pWordCount = 0; // Initialize the output to Zero
*pMaxWordLen = 0;
char* word = strtok(input, delims);
while(word)
{
(*pWordCount)++;
*pMaxWordLen = MAX(*pMaxWordLen, (int)strlen(word));
word = strtok(NULL, delims);
}
free(input);
}
This simple function works. There are no bugs so far.
Now let us augment our library to add a function that gathers stats on Paragraphs of text.
A paragraph is a set of sentences separated by Exclamation Marks, Question Marks and Periods.
It will return the number of sentences in the paragraph, and the number of words in the longest sentence.
And perhaps most importantly, it will use the earlier function GetSentenceStats
to help
void GetParagraphStats(const char* paragraph, int* pSentenceCount, int* pMaxWords)
{
char* delims = ".!?"; // Sentences in a paragraph are separated by Period, Question-Mark, and Exclamation.
char* input = strdup(paragraph); // Make an local copy of the paragraph, to be modified without affecting the caller.
*pSentenceCount = 0;
*pMaxWords = 0;
char* sentence = strtok(input, delims);
while(sentence)
{
(*pSentenceCount)++;
int wordCount;
int longestWord;
GetSentenceStats(sentence, &wordCount, &longestWord);
*pMaxWords = MAX(*pMaxWords, wordCount);
sentence = strtok(NULL, delims); // This line returns garbage data,
}
free(input);
}
This function also looks very simple and straightforward.
But it does not work, as demonstrated by this sample program.
int main(void)
{
int cnt;
int len;
// First demonstrate that the SentenceStats function works properly:
char *sentence = "There were a king with a large jaw and a queen with a plain face, on the throne of England.";
GetSentenceStats(sentence, &cnt, &len);
printf("Word Count: %d\nLongest Word: %d\n", cnt, len);
// Correct Answer:
// Word Count: 20
// Longest Word: 7 ("England")
printf("\n\nAt this point, expected output is 20/7.\nEverything is working fine\n\n");
char paragraph[] = "It was the best of times!" // Literary purists will note I have changed Dicken's original text to make a better example
"It was the worst of times?"
"It was the age of wisdom."
"It was the age of foolishness."
"We were all going direct to Heaven!";
int sentenceCount;
int maxWords;
GetParagraphStats(paragraph, &sentenceCount, &maxWords);
printf("Sentence Count: %d\nLongest Sentence: %d\n", sentenceCount, maxWords);
// Correct Answer:
// Sentence Count: 5
// Longest Sentence: 7 ("We were all going direct to Heaven")
printf("\n\nAt the end, expected output is 5/7.\nBut Actual Output is Undefined Behavior! Strtok is hopelessly broken\n");
_getch();
return 0;
}
All calls to strtok
are entirely correct, and are on separate data.
But the result is Undefined Behavior!
Why does this happen?
When GetParagraphStats
is called, it begins a strtok
-loop to get sentences.
On the first sentence it will call GetSentenceStats
. GetSentenceStats
will also being a strtok
-loop, losing all state established by GetParagraphStats
.
When GetSentenceStats
returns, the caller (GetParagraphStats
) will call strtok(NULL)
again to get the next sentence.
But strtok
will think this is a call to continue the previous operation, and will continue tokenizing memory that has now been freed!
The result is the dreaded Undefined Behavior.
When is it safe to use strtok?
Even in a single-threaded environment, strtok
can only be used safely when the programmer/architect is sure of two conditions:
The function using strtok
must never call any function that may also use strtok.
If it calls a subroutine that also uses strtok, its own use of strtok may be interrupted.
The function using strtok
must never be called by any function that may also use strtok.
If this function ever called by another routine using strtok, then this function will interrupt the callers use of strtok.
In a multi-threaded environment, use of strtok
is even more impossible, because the programmer needs to be sure that there is only one use of strtok
on the current thread, and also, no other threads are using strtok
either.
strtok
is often used wrong. – Brilliantstrtok
still leads to Undefined Behavior. – Fuggerstrtok
as if it were re-entrant. But it isn't. – Brilliantstrtok
for the same string are generally treated as if they're one call tostrtok
that gets interrupted, and re-entered later (because the two are conceptually the same). But if you prefer, I'll phrase my earlier comment differently : the same reason whystrtok
is not re-entrant (it keeps global state), is the reason why your code is invalid. – Brilliantprintf
without also tracking where other calls toprintf
occur, because they might interfere with each other? With strtok, you need to not only track your own use of the function, but any other use anywhere in the program! – FuggerGetSentenceStats
was written by the Sentence-team: They wrote unit-tests, they verified its behavior, and delivered their finished product to another team. The Paragraph-team does not know that the Sentence-team usedstrtok
. The Paragraph-team wrote their function to the specification, and called the Sentence-function properly according to the spec. Every team followed their obligations, and the program still does not work. – Fuggerstrtok
is broken because you can't use it in a way that it was never designed for, makes no sense. Sure,strtok
can only be safely used under certain well-defined conditions (which is why they addedstrtok_s
later). But under those well-defined conditions,strtok
is good at what it does, and is not broken. – Brilliant