How does strtok() split the string into tokens in C?
Asked Answered
A

16

136

Please explain to me the working of strtok() function. The manual says it breaks the string into tokens. I am unable to understand from the manual what it actually does.

I added watches on str and *pch to check its working when the first while loop occurred, the contents of str were only "this". How did the output shown below printed on the screen?

/* strtok example */
#include <stdio.h>
#include <string.h>

int main ()
{
  char str[] ="- This, a sample string.";
  char * pch;
  printf ("Splitting string \"%s\" into tokens:\n",str);
  pch = strtok (str," ,.-");
  while (pch != NULL)
  {
    printf ("%s\n",pch);
    pch = strtok (NULL, " ,.-");
  }
  return 0;
}

Output:

Splitting string "- This, a sample string." into tokens:
This
a
sample
string
Antennule answered 8/10, 2010 at 11:23 Comment(10)
strtok() modifies its argument string by terminating tokens with NUL before returning. If you try to examine the whole buffer (str[]) you'll see it being modified between successive calls to strtok().Whatley
Instead of watching str, watch str[0], str[1], str[2], ...Cammycamomile
@pmg:I watched str[0] and str[1].str[1] should be '\0',but it was a space there.Antennule
@fahad: if I have a string "xxxxxxhello" and make a pointer (pch) point to the 'h', the 'x's are as good as inexistent (as long as I access the string through pch)Cammycamomile
When there is nothing between the two delimiters,strtok does not append a '\0'?Antennule
so you want the empty char? not possible.Perspective
Honestly I've never bothered to check, but I imagine it stores the last pointer passed in, along with the position it left off at. Then it can just continue if the pointer is NULL, or clear the position and start over if not.Conjoint
Pretty much - it saves the last result it returned and continues searching from the next character if you pass NULL. This obviously makes it not thread safe, only one tokenization can be active at a time. ReferenceTamarind
Is it a closure? I don't know how a function can store the status in C.Triform
@Firegun: static variable.Tamarind
P
46

strtok() divides the string into tokens. i.e. starting from any one of the delimiter to next one would be your one token. In your case, the starting token will be from "-" and end with next space " ". Then next token will start from " " and end with ",". Here you get "This" as output. Similarly the rest of the string gets split into tokens from space to space and finally ending the last token on "."

Pyrrho answered 8/10, 2010 at 11:33 Comment(7)
the ending condition for one token becomes the starting token of the next token?also is there a nul character placed in the place of the ending condition?Antennule
@fahad- Yes, all the delimeters you have will be replaced by NUL character as other people have also suggested.Pyrrho
If all the delimiters are replaced by Nul,than why does the string contain"-this"? It should contain "\0"Antennule
@fahad - It only replaces the delimiter characters with NUL, not all the characters between delimiters. Its kind of splitting the string into multiple tokens. You get "This" because its between two specified delimiters and not the "-this".Pyrrho
so replacing the second delimiter,a nul is placed?Antennule
@Fahad - Yes, absolutely. All spaces, "," and "-" are replaced by NUL because you have specified these as delimiters, as far as I understand.Pyrrho
I observed str[0] and str[1].str[1] should be '\0' as you said because str[0] is '-',but it was a space there.Antennule
R
263

the strtok runtime function works like this

the first time you call strtok you provide a string that you want to tokenize

char s[] = "this is a string";

in the above string space seems to be a good delimiter between words so lets use that:

char* p = strtok(s, " ");

what happens now is that 's' is searched until the space character is found, the first token is returned ('this') and p points to that token (string)

in order to get next token and to continue with the same string NULL is passed as first argument since strtok maintains a static pointer to your previous passed string:

p = strtok(NULL," ");

p now points to 'is'

and so on until no more spaces can be found, then the last string is returned as the last token 'string'.

more conveniently you could write it like this instead to print out all tokens:

for (char *p = strtok(s," "); p != NULL; p = strtok(NULL, " "))
{
  puts(p);
}

EDIT:

If you want to store the returned values from strtok you need to copy the token to another buffer e.g. strdup(p); since the original string (pointed to by the static pointer inside strtok) is modified between iterations in order to return the token.

Rafaelof answered 8/10, 2010 at 11:51 Comment(9)
So it does not actually place a nul character between the string?Why does my watch show that the string is left only with "THIS"?Antennule
it does indeed replace the ' ' it found with '\0'. And, it does not restore ' ' later, so your string is ruined for good.Elboa
+1 for static buffer, this is what I didn't understandDebra
A very important detail, missing from the line "the first token is returned and p points to that token", is that strtok needs to mutate the original string by placing a null characters in place of a delimiter (otherwise other string functions wouldn't know where the token ends). And it also keeps track of the state using a static variable.Overweening
@Groo I think I already added that in the Edit that I did in 2017, but you are right.Rafaelof
@Rafaelof you still never explicitly mention that the delimiter is replaced by \0 which is necessary. You just say the string is modified.Pelite
so p points to the letter 't' at first. then when it finds a separator, turns that into \0. then you make a new call, passing NULL as argument. how come is p not "pointing" to NULL (i.e. not a NULL pointer) at this point, thus ending the loop?Shaniceshanie
How does p = strtok(NULL," "); know that you are working with s?Rubricate
@Rubricate In the first call to strtok, the pointer is stored internally in the function, probably in a static variable, so if the first argument is NULL it will use and modify that internal pointer.Rafaelof
P
46

strtok() divides the string into tokens. i.e. starting from any one of the delimiter to next one would be your one token. In your case, the starting token will be from "-" and end with next space " ". Then next token will start from " " and end with ",". Here you get "This" as output. Similarly the rest of the string gets split into tokens from space to space and finally ending the last token on "."

Pyrrho answered 8/10, 2010 at 11:33 Comment(7)
the ending condition for one token becomes the starting token of the next token?also is there a nul character placed in the place of the ending condition?Antennule
@fahad- Yes, all the delimeters you have will be replaced by NUL character as other people have also suggested.Pyrrho
If all the delimiters are replaced by Nul,than why does the string contain"-this"? It should contain "\0"Antennule
@fahad - It only replaces the delimiter characters with NUL, not all the characters between delimiters. Its kind of splitting the string into multiple tokens. You get "This" because its between two specified delimiters and not the "-this".Pyrrho
so replacing the second delimiter,a nul is placed?Antennule
@Fahad - Yes, absolutely. All spaces, "," and "-" are replaced by NUL because you have specified these as delimiters, as far as I understand.Pyrrho
I observed str[0] and str[1].str[1] should be '\0' as you said because str[0] is '-',but it was a space there.Antennule
M
35

strtok maintains a static, internal reference pointing to the next available token in the string; if you pass it a NULL pointer, it will work from that internal reference.

This is the reason strtok isn't re-entrant; as soon as you pass it a new pointer, that old internal reference gets clobbered.

Milker answered 17/5, 2012 at 18:22 Comment(2)
What do you mean by the old internal reference 'getting clobbered'. Do you mean 'overwritten'?Kickback
@ylun.ca: yes, that's what I mean.Milker
H
12

strtok doesn't change the parameter itself (str). It stores that pointer (in a local static variable). It can then change what that parameter points to in subsequent calls without having the parameter passed back. (And it can advance that pointer it has kept however it needs to perform its operations.)

From the POSIX strtok page:

This function uses static storage to keep track of the current string position between calls.

There is a thread-safe variant (strtok_r) that doesn't do this type of magic.

Hyams answered 17/5, 2012 at 18:22 Comment(5)
Well, the C library functions date from way-back-when, threading wasn't in the picture at all (that only started existing in 2011 as far as the C standard is concerned), so re-entrancy wasn't really important (I guess). That static local make the function "easy to use" (for some definition of "easy"). Like ctime returning a static string - practical (no-one needs to wonder who should free it), but not re-entrant and trips you up if you're not very aware of it.Hyams
This is wrong: "strtok doesn't change the parameter itself (str)." puts(str); prints "- This" since strtok modified str.Icken
@MarredCheese: read again. It does not modify the pointer. It modifies the data the pointer points to (i.e. the string data)Hyams
Oh ok, I didn't realize that's what you getting at. Agreed.Icken
As far as the standard says, the parameter itself being modified is allowed but not required.Tawannatawdry
S
10

strtok will tokenize a string i.e. convert it into a series of substrings.

It does that by searching for delimiters that separate these tokens (or substrings). And you specify the delimiters. In your case, you want ' ' or ',' or '.' or '-' to be the delimiter.

The programming model to extract these tokens is that you hand strtok your main string and the set of delimiters. Then you call it repeatedly, and each time strtok will return the next token it finds. Till it reaches the end of the main string, when it returns a null. Another rule is that you pass the string in only the first time, and NULL for the subsequent times. This is a way to tell strtok if you are starting a new session of tokenizing with a new string, or you are retrieving tokens from a previous tokenizing session. Note that strtok remembers its state for the tokenizing session. And for this reason it is not reentrant or thread safe (you should be using strtok_r instead). Another thing to know is that it actually modifies the original string. It writes '\0' for teh delimiters that it finds.

One way to invoke strtok, succintly, is as follows:

char str[] = "this, is the string - I want to parse";
char delim[] = " ,-";
char* token;

for (token = strtok(str, delim); token; token = strtok(NULL, delim))
{
    printf("token=%s\n", token);
}

Result:

this
is
the
string
I
want
to
parse
Storekeeper answered 8/10, 2010 at 13:35 Comment(0)
P
9

The first time you call it, you provide the string to tokenize to strtok. And then, to get the following tokens, you just give NULL to that function, as long as it returns a non NULL pointer.

The strtok function records the string you first provided when you call it. (Which is really dangerous for multi-thread applications)

Pabulum answered 8/10, 2010 at 11:32 Comment(0)
M
5

strtok modifies its input string. It places null characters ('\0') in it so that it will return bits of the original string as tokens. In fact strtok does not allocate memory. You may understand it better if you draw the string as a sequence of boxes.

Moynihan answered 8/10, 2010 at 11:32 Comment(6)
@Tawannatawdry a pointer is allocated for strtok at the time when the program is loaded for execution; calling strtok does not allocate any new memoryMoynihan
Then where are the function arguments stored? Nowhere?Tawannatawdry
@Tawannatawdry they are stored in the static memory segment. This is a one-time-per-process allocation. When someone asks if a function allocates memory, they want to know about any malloc calls being made as a result of calling the function. Strtok does not result in any malloc calls, which can be good or bad depending on the programmer needs. No new memory allocated, but it modifies the input string, and you can't do concurrent parsing of two strings in the same process.Moynihan
Wrong! The function arguments of strtok (and most [all?] other functions) have automatic storage duration and not static storage duration. A libc programmer could seriously mess things up based on your misinformation.Tawannatawdry
@Tawannatawdry I think you're confusedMoynihan
I think you're confused. I asked about the function arguments and not the tracking pointer.Tawannatawdry
H
4

To understand how strtok() works, one first need to know what a static variable is. This link explains it quite well....

The key to the operation of strtok() is preserving the location of the last seperator between seccessive calls (that's why strtok() continues to parse the very original string that is passed to it when it is invoked with a null pointer in successive calls)..

Have a look at my own strtok() implementation, called zStrtok(), which has a sligtly different functionality than the one provided by strtok()

char *zStrtok(char *str, const char *delim) {
    static char *static_str=0;      /* var to store last address */
    int index=0, strlength=0;           /* integers for indexes */
    int found = 0;                  /* check if delim is found */

    /* delimiter cannot be NULL
    * if no more char left, return NULL as well
    */
    if (delim==0 || (str == 0 && static_str == 0))
        return 0;

    if (str == 0)
        str = static_str;

    /* get length of string */
    while(str[strlength])
        strlength++;

    /* find the first occurance of delim */
    for (index=0;index<strlength;index++)
        if (str[index]==delim[0]) {
            found=1;
            break;
        }

    /* if delim is not contained in str, return str */
    if (!found) {
        static_str = 0;
        return str;
    }

    /* check for consecutive delimiters
    *if first char is delim, return delim
    */
    if (str[0]==delim[0]) {
        static_str = (str + 1);
        return (char *)delim;
    }

    /* terminate the string
    * this assignmetn requires char[], so str has to
    * be char[] rather than *char
    */
    str[index] = '\0';

    /* save the rest of the string */
    if ((str + index + 1)!=0)
        static_str = (str + index + 1);
    else
        static_str = 0;

        return str;
}

And here is an example usage

  Example Usage
      char str[] = "A,B,,,C";
      printf("1 %s\n",zStrtok(s,","));
      printf("2 %s\n",zStrtok(NULL,","));
      printf("3 %s\n",zStrtok(NULL,","));
      printf("4 %s\n",zStrtok(NULL,","));
      printf("5 %s\n",zStrtok(NULL,","));
      printf("6 %s\n",zStrtok(NULL,","));

  Example Output
      1 A
      2 B
      3 ,
      4 ,
      5 C
      6 (null)

The code is from a string processing library I maintain on Github, called zString. Have a look at the code, or even contribute :) https://github.com/fnoyanisi/zString

Hornet answered 18/2, 2016 at 1:1 Comment(0)
D
3

This is how i implemented strtok, Not that great but after working 2 hr on it finally got it worked. It does support multiple delimiters.

#include "stdafx.h"
#include <iostream>
using namespace std;

char* mystrtok(char str[],char filter[]) 
{
    if(filter == NULL) {
        return str;
    }
    static char *ptr = str;
    static int flag = 0;
    if(flag == 1) {
        return NULL;
    }
    char* ptrReturn = ptr;
    for(int j = 0; ptr != '\0'; j++) {
        for(int i=0 ; filter[i] != '\0' ; i++) {
            if(ptr[j] == '\0') {
                flag = 1;
                return ptrReturn;
            }
            if( ptr[j] == filter[i]) {
                ptr[j] = '\0';
                ptr+=j+1;
                return ptrReturn;
            }
        }
    }
    return NULL;
}

int _tmain(int argc, _TCHAR* argv[])
{
    char str[200] = "This,is my,string.test";
    char *ppt = mystrtok(str,", .");
    while(ppt != NULL ) {
        cout<< ppt << endl;
        ppt = mystrtok(NULL,", ."); 
    }
    return 0;
}
Determinate answered 8/5, 2017 at 18:55 Comment(0)
D
3

For those who are still having hard time understanding this strtok() function, take a look at this pythontutor example, it is a great tool to visualize your C (or C++, Python ...) code.

In case the link got broken, paste in:

#include <stdio.h>
#include <string.h>

int main()
{
    char s[] = "Hello, my name is? Matthew! Hey.";
    char* p;
    for (char *p = strtok(s," ,?!."); p != NULL; p = strtok(NULL, " ,?!.")) {
      puts(p);
    }
    return 0;
}

Credits go to Anders K.

Determinative answered 8/5, 2018 at 12:24 Comment(0)
S
2

Here is my implementation which uses hash table for the delimiter, which means it O(n) instead of O(n^2) (here is a link to the code):

#include<stdio.h>
#include<stdlib.h>
#include<string.h>

#define DICT_LEN 256

int *create_delim_dict(char *delim)
{
    int *d = (int*)malloc(sizeof(int)*DICT_LEN);
    memset((void*)d, 0, sizeof(int)*DICT_LEN);

    int i;
    for(i=0; i< strlen(delim); i++) {
        d[delim[i]] = 1;
    }
    return d;
}



char *my_strtok(char *str, char *delim)
{

    static char *last, *to_free;
    int *deli_dict = create_delim_dict(delim);

    if(!deli_dict) {
        /*this check if we allocate and fail the second time with entering this function */
        if(to_free) {
            free(to_free);
        }
        return NULL;
    }

    if(str) {
        last = (char*)malloc(strlen(str)+1);
        if(!last) {
            free(deli_dict);
            return NULL;
        }
        to_free = last;
        strcpy(last, str);
    }

    while(deli_dict[*last] && *last != '\0') {
        last++;
    }
    str = last;
    if(*last == '\0') {
        free(deli_dict);
        free(to_free);
        deli_dict = NULL;
        to_free = NULL;
        return NULL;
    }
    while (*last != '\0' && !deli_dict[*last]) {
        last++;
    }

    *last = '\0';
    last++;

    free(deli_dict);
    return str;
}

int main()
{
    char * str = "- This, a sample string.";
    char *del = " ,.-";
    char *s = my_strtok(str, del);
    while(s) {
        printf("%s\n", s);
        s = my_strtok(NULL, del);
    }
    return 0;
}
Stratton answered 5/3, 2017 at 13:16 Comment(0)
F
2

strtok() stores the pointer in static variable where did you last time left off , so on its 2nd call , when we pass the null , strtok() gets the pointer from the static variable .

If you provide the same string name , it again starts from beginning.

Moreover strtok() is destructive i.e. it make changes to the orignal string. so make sure you always have a copy of orignal one.

One more problem of using strtok() is that as it stores the address in static variables , in multithreaded programming calling strtok() more than once will cause an error. For this use strtok_r().

Figment answered 27/1, 2018 at 12:5 Comment(0)
D
1

strtok replaces the characters in the second argument with a NULL and a NULL character is also the end of a string.

http://www.cplusplus.com/reference/clibrary/cstring/strtok/

Dierolf answered 8/10, 2010 at 11:32 Comment(0)
D
0

you can scan the char array looking for the token if you found it just print new line else print the char.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main()
{
    char *s;
    s = malloc(1024 * sizeof(char));
    scanf("%[^\n]", s);
    s = realloc(s, strlen(s) + 1);
    int len = strlen(s);
    char delim =' ';
    for(int i = 0; i < len; i++) {
        if(s[i] == delim) {
            printf("\n");
        }
        else {
            printf("%c", s[i]);
        }
    }
    free(s);
    return 0;
}
Dialogist answered 6/12, 2019 at 21:7 Comment(0)
T
0

So, this is a code snippet to help better understand this topic.

Printing Tokens

Task: Given a sentence, s, print each word of the sentence in a new line.

char *s;
s = malloc(1024 * sizeof(char));
scanf("%[^\n]", s);
s = realloc(s, strlen(s) + 1);
//logic to print the tokens of the sentence.
for (char *p = strtok(s," "); p != NULL; p = strtok(NULL, " "))
{
    printf("%s\n",p);
}

Input: How is that

Result:

How
is
that

Explanation: So here, "strtok()" function is used and it's iterated using for loop to print the tokens in separate lines.

The function will take parameters as 'string' and 'break-point' and break the string at those break-points and form tokens. Now, those tokens are stored in 'p' and are used further for printing.

Towardly answered 29/2, 2020 at 8:7 Comment(1)
i think explaining via an example is much better than referring to some doc.Towardly
C
0

strtok is replacing delimiter with'\0' NULL character in given string

CODE

#include<iostream>
#include<cstring>

int main()
{
    char s[]="30/4/2021";     
    std::cout<<(void*)s<<"\n";    // 0x70fdf0
    
    char *p1=(char*)0x70fdf0;
    std::cout<<p1<<"\n";
    
    char *p2=strtok(s,"/");
    std::cout<<(void*)p2<<"\n";
    std::cout<<p2<<"\n";
    
    char *p3=(char*)0x70fdf0;
    std::cout<<p3<<"\n";
    
    for(int i=0;i<=9;i++)
    {
        std::cout<<*p1;
        p1++;
    }
    
}

OUTPUT

0x70fdf0       // 1. address of string s
30/4/2021      // 2. print string s through ptr p1 
0x70fdf0       // 3. this address is return by strtok to ptr p2
30             // 4. print string which pointed by p2
30             // 5. again assign address of string s to ptr p3 try to print string
30 4/2021      // 6. print characters of string s one by one using loop

Before tokenizing the string

I assigned address of string s to some ptr(p1) and try to print string through that ptr and whole string is printed.

after tokenized

strtok return the address of string s to ptr(p2) but when I try to print string through ptr it only print "30" it did not print whole string. so it's sure that strtok is not just returning adress but it is placing '\0' character where delimiter is present.

cross check

1.

again I assign the address of string s to some ptr (p3) and try to print string it prints "30" as while tokenizing the string is updated with '\0' at delimiter.

2.

see printing string s character by character via loop the 1st delimiter is replaced by '\0' so it is printing blank space rather than ''

Cheke answered 1/5, 2021 at 3:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.