Why should strtok() be deprecated?

Asked 2/6, 2017 at 20:17 Answered 2/6, 2017 at 20:28

I hear this from a lot of programmers that the use of strtok maybe deprecated in near future. Some say it is still. Why is it a bad choice? strtok() works great in tokenizing a given string. Does it have to do anything with the time and space complexities? Best link I found on the internet was this. But that doesn't seem to solve my curiousity. Suggest any alternatives if possible.

Berey answered 2/6, 2017 at 20:17 Comment(3)

At least my own argument is that it is misleadingly destructive. It modifies the source string which generally one does not want to do while tokenising. – Ethyne 2/6, 2017 at 20:20

For me, once I got comfortable with using regcomp and regexec I found using regex(3) to be much more useful and powerful. – Treulich 2/6, 2017 at 20:24

Possible duplicate of Why is strtok() Considered Unsafe? – Sporangium 3/6, 2017 at 3:40

Why is it a bad choice?

The fundamental technique for solving problems by programming is to construct abstractions which can be used reliably to solve sub-problems, and then compose solutions to those sub-problems into solutions to larger problems.

strtok's behaviour works directly against these goals in a variety of ways; it is a poor abstraction that is unreliable because it composes poorly.

The fundamental problem of tokenization is: given a position in a string, give the position of the end of the token beginning at that position. If strtok did only that, it would be great. It would have a clear abstraction, it would not rely on hidden global state, it would not modify its inputs.

To see the limitations of strtok, imagine trying to tokenize a language where we wish to separate tokens by spaces, unless the token is enclosed in " ", in which case we wish to apply a different tokenization rule to the contents of the quoted area, and then pick up with the space separation rule after. strtok composes very poorly with itself, and is therefore only useful for the most trivial of tokenization tasks.

Does it have to do anything with the time and space complexities?

No.

Suggest any alternatives if possible.

Lexers are not hard to write; just write one!

Bonus points if you write an immutable lexer. An immutable lexer is a little struct that contains a reference to the string being lexed, the current position of the lexer, and any state needed by the lexer. To extract a token you call a "next token" method, pass in the lexer, and you get back the token and a new lexer. The new lexer can then be used to lex the next token, and you discard the previous lexer if you wish.

The immutable lexer technique is easier to reason about than lexers which modify state. And you can debug them by saving the discarded lexers in a list, and now you have the complete history of tokenization operations open to inspection at once.

Uptown answered 2/6, 2017 at 20:28 Comment(5)

I never saw lexers this way. Thanks for bringing that up – Berey 2/6, 2017 at 20:37

"it would not modify its inputs" ... which is actually invalid in some common situations; for example, strtok("hello world", " ") is clearly wrong to a seasoned C programmer, yet to a beginner this seems like it'd be fine and dandy! Nonetheless, it's an easy mistake to make for both. – Dodecanese 3/6, 2017 at 1:28

While this answer describes strtok's limitations in comparison to proper lexers, I don't feel that it directly explains why it should be deprecated (except for a brief mention that "strtok composes very poorly with itself" without explaining why it composes poorly). Also, usually things that are deprecated from the standard library are replaced with something else (which in the case of strtok probably would be something like strtok_r or strtok_s). – Denney 3/6, 2017 at 4:57

@Denney Or strsep, or some logic using strcspn and memcpy as building blocks. – Erythrocyte 3/6, 2017 at 5:38

@jamesdlin: I encourage you to write an answer that you like better, that we might all benefit from your insights. – Uptown 5/6, 2017 at 16:41

The limitation of strtok(char *str, const char *delim) is that it can't work on multiple strings simultaneously as it maintains a static pointer to store the index till it has parsed (hence sufficient if playing with only one string at a time). The better and safer method is to use strtok_r(char *str, const char *delim, char **saveptr) which explicitly takes a third pointer to save the parsed index.

Touzle answered 2/6, 2017 at 20:22 Comment(3)

In other words, it inherently modifies not just global state, but hidden global state! – Aquarium 2/6, 2017 at 20:27

Put another way, it's not re-entrant. – Tussock 3/6, 2017 at 6:21

Put another way, it sucks in multiple ways :) – Wingfooted 4/6, 2017 at 11:20

Recommended topics

Hot tags