strtok() issue: If tokens are delimited by delimiters,why is last token between a delimiter and the null '\0'?
Asked Answered
O

5

7

In the following program, strtok() works as expected in the major part but I just can't comprehend the reason behind one finding. I have read about strtok() that:

To determine the beginning and the end of a token, the function first scans from the starting location for the first character not contained in delimiters (which becomes the beginning of the token). And then scans starting from this beginning of the token for the first character contained in delimiters, which becomes the end of the token.

Source: http://www.cplusplus.com/reference/cstring/strtok/

And as we know, strtok() places a \0 at the end of each token. But in the following program, the last delimiter is a dot(.), after which there is Toad between that dot and the quotation mark ("). Now the dot is a delimiter in my program, but there is no delimiter after Toad, not even a white space (which is a delimiter in my program). Please clear the following confusion arising from this premise:

Why is strtok() considering Toad as a token even though it is not between 2 delimiters? This is what I read about strtok() when it encounters a NULL character (\0):

Once the terminating null character of str has been found in a call to strtok, all subsequent calls to this function with a null pointer as the first argument return a null pointer.

Source: http://www.cplusplus.com/reference/cstring/strtok/

Nowhere does it say that once a null character is encountered,a pointer to the beginning of the token is returned (we don't even have a token here as we didn't get an end of the token as there was no delimiter character found after the scan begun from the beginning of the token (i.e. from 'T' of Toad), we only found a null character, not a delimiter). So why is the part between last delimiter and quotation mark of argument string considered a token by strtok()? Please explain this.

Code:

#include <stdio.h>
#include <string.h>

int main ()
{
  char str[] =" Falcon,eagle-hawk..;buzzard,gull..pigeon sparrow,hen;owl.Toad";
  char * pch=strtok(str," ;,.-");

    while (pch != NULL)
  {
    printf ("%s\n",pch);
    pch = strtok (NULL, " ;,.-");
  }

  return 0;
}

Output:

Falcon
eagle
hawk
buzzard
gull
pigeon
sparrow
hen
owl
Toad

Octet answered 15/5, 2013 at 17:7 Comment(6)
Not sure I understand your question; what output did you expect? That Toad would not be printed? Going by that logic if you remove the leading space in the input string, Falcon shouldn't be printed either. I would say that makes for some unintuitive behavior.Cetane
If you deleted the blank before the Falcon, strtok() would still consider 'Falcon' to be the first token.Giralda
@JonathanLeffler I have deliberately done that.Like I said ,all is as expected from strtok(),except the last token,which is clearly not between two delimiters.Madder
@JonathanLeffler I regret I had to go outside right after posting this question.Madder
@Cetane Why shouldn't I expect the Falcon to be printed?I have mentioned from the source that the function first scans from the starting location for the first character not contained in delimiters..ie,for the beginning of the token we don't need a delimiter(space is a delimiter in my program),but to mark the end of the token we clearly need a delimiter,and NULL at the string end is not on the delimiter list.Madder
@JonathanLeffler I am surprised I couldn't convey my point even to you in this question.Madder
W
9

The standard's specification of strtok (7.24.5.8) is pretty clear. In particular paragraph 4 (emphasis added by me) is directly relevant to the question, if I understand that correctly:

3 The first call in the sequence searches the string pointed to by s1 for the first character that is not contained in the current separator string pointed to by s2. If no such character is found, then there are no tokens in the string pointed to by s1 and the strtok function returns a null pointer. If such a character is found, it is the start of the first token.

4 The strtok function then searches from there for a character that is contained in the current separator string. If no such character is found, the current token extends to the end of the string pointed to by s1, and subsequent searches for a token will return a null pointer. If such a character is found, it is overwritten by a null character, which terminates the current token. The strtok function saves a pointer to the following character, from which the next search for a token will start.

In a call

char *where = strtok(string_or_NULL, delimiters);

the token (a pointer to which is) returned - if any - extends from the first non-delimiter character found from the starting position (inclusive) until the next delimiter character (exclusive), if one exists, or the end of the string, if no later delimiter character exists.

The linked description doesn't explicitly mention the case of a token extending until the end of the string, as opposed to the standard, so it is incomplete in that respect.

Warr answered 15/5, 2013 at 19:50 Comment(2)
If no such character is found, the current token extends to the end of the string pointed to by s1, and subsequent searches for a token will return a null pointer---Thank you,that nails it,taken right from the standard.That's exactly what I wanted to know.Madder
BULL'S EYE. BANG ON TARGET!!Madder
G
4

Going to the description in POSIX for strtok(), the description says:

char *strtok(char *restrict s1, const char *restrict s2);

A sequence of calls to strtok() breaks the string pointed to by s1 into a sequence of tokens, each of which is delimited by a byte from the string pointed to by s2. The first call in the sequence has s1 as its first argument, and is followed by calls with a null pointer as their first argument. The separator string pointed to by s2 may be different from call to call.

The first call in the sequence searches the string pointed to by s1 for the first byte that is not contained in the current separator string pointed to by s2. If no such byte is found, then there are no tokens in the string pointed to by s1 and strtok() shall return a null pointer. If such a byte is found, it is the start of the first token.

The strtok() function then searches from there for a byte that is contained in the current separator string. If no such byte is found, the current token extends to the end of the string pointed to by s1, and subsequent searches for a token shall return a null pointer. If such a byte is found, it is overwritten by a NUL character, which terminates the current token. The strtok() function saves a pointer to the following byte, from which the next search for a token shall start.

Note the second sentence of the third paragraph:

If no such byte is found, the current token extends to the end of the string pointed to by s1, and subsequent searches for a token shall return a null pointer.

This clearly states that in the example in the question, Toad is indeed a token. One way to think of it is that the list of delimiters always includes the NUL '\0' at the end of the delimiter string.


Having diagnosed that, note that strtok() is not a good function to use — it is not thread safe or reentrant. On Windows, you can use strtok_s() instead; on Unix, you can usually use strtok_r(). These are better functions because they don't store internally the pointer at which the search is to resume.

Because strtok() is not reentrant, you cannot call a function that uses strtok() from inside a function that itself uses strtok() while it is using strtok(). Also, any library function that uses strtok() must be clearly identified as doing so because it cannot be called from a function that is using strtok(). So, using strtok() makes life hard.

The other problem with the strtok() family of functions (and with strsep(), which is related) is that they overwrite the delimiter; you can't find out what the delimiter was after the tokenizer has tokenized the string. This can matter in some applications (such as parsing shell command lines; it matters whether the delimiter is a pipe or a semicolon or an ampersand (or ...). So shell parsers usually don't use strtok(), despite the number of questions on SO about shells where the parser does use strtok().

Generally, you should steer clear of plain strtok(), and it is up to you to decide whether strtok_r() or strtok_s() is appropriate for your purposes.

Giralda answered 15/5, 2013 at 19:56 Comment(2)
Daniel Fischer beat you to it just by a few seconds!!Madder
Yeah -- I shouldn't have gone to lunch...I saw that his answer arrived while I was writing mine, but only after I'd hit submit.Giralda
G
2

Because cplusplus.com isn't telling you the whole story. Cppreference.com has a better description.

Cplusplus.com also fails to mention that strtok is not thread-safe, and only documents the strtok function of the C++ programming language, whereas cppreference.com does mention the thread safety issue and documents the strtok functions of both the C and the C++ programming languages.

Girish answered 15/5, 2013 at 17:55 Comment(0)
L
0

strtok breaks a string to a sequence of tokens, separated by the given delimeters. Delimeters only separate tokens, not necesarily terminate them on both side.

Lincolnlincolnshire answered 15/5, 2013 at 17:11 Comment(0)
S
0

Are you perhaps just mis-reading the description?

Once the terminating null character of str has been found in a call to strtok, all subsequent calls to this function with a null pointer as the first argument return a null pointer.

Given 'subsequent', I'm reading this as every call to strtok after the one that discovered \0, not necessarily the current one itself. So, the definition is consistent with behavior (and with what you would expect from strtok).

Serotherapy answered 15/5, 2013 at 17:14 Comment(8)
From the description from the source,it is obvious that it says that the end of the token is not possible without a delimiter.Subsequent calls or current call doesn't mater in this context.Here is what it says for the end of a token--And then scans starting from this beginning of the token for the first character contained in delimiters, which becomes the end of the token.Madder
@Rüppell'sVulture I agree that that description doesn't describe well in a case where the initial string is ".Toad". However it seems clear at this point that the issue here is just poor documentation on the part of the source, nothing wrong with strtok per se.Serotherapy
I won't say strtok is wrong even by a slip of tongue!!Anyways,you got close to what I intend to ask....See,at the end of the penultimate token,the pointer is pointing to T of Toad,but to mark the end of the token, it needs a delimiter.But there is no delimiter after that and the null character is encountered,at which point it stops.So how is Toad a token?Madder
@Rüppell'sVulture :) I'm not sure where you're from but in the US we say 'Uncle!' at this point--yes, you're right! cplusplus.com's documentation is inadequate. But though popular, there is no sense I know of in which it's canonical or representative of the C language in any official way. So perhaps shoot them an email...Serotherapy
Search for any library function in C and cplusplus.com comes first on google.That had made me feel it's as holy as the Bible.But now I am having second thoughts.I have been cautioned many times in the last few days about that sight.Whom to trust in this world now?Madder
Hey Matt,look at the two new answers I got from DF and JL.Madder
@Rüppell'sVulture Right, indeed "The linked description doesn't explicitly mention the case of a token extending until the end of the string, as opposed to the standard, so it is incomplete in that respect." from DF is exactly what I was saying. So looks like there's a pretty clear consensus on this one.Serotherapy
@MattPhllips I'll be careful about that site henceforth.Actually the layout of that site is very attractive,and professional looking.And it has the "Business only,no small talk" feel about it.Madder

© 2022 - 2024 — McMap. All rights reserved.