Using strtok with a std::string
Asked Answered
C

14

61

I have a string that I would like to tokenize. But the C strtok() function requires my string to be a char*. How can I do this simply?

I tried:

token = strtok(str.c_str(), " "); 

which fails because it turns it into a const char*, not a char*

Capacitance answered 14/11, 2008 at 6:14 Comment(1)
See this question: #54349Maladapted
E
80
#include <iostream>
#include <string>
#include <sstream>
int main(){
    std::string myText("some-text-to-tokenize");
    std::istringstream iss(myText);
    std::string token;
    while (std::getline(iss, token, '-'))
    {
        std::cout << token << std::endl;
    }
    return 0;
}

Or, as mentioned, use boost for more flexibility.

Emogene answered 14/11, 2008 at 6:29 Comment(2)
strtok() supports multiple delimiters while getline does not. Is there a simple way to circumvent that?Crosscrosslet
@Crosscrosslet I believe you could use regex_token_iterator to tokenize with multiple delimiters. And thanks for the blast from the past, I answered the original question a loooooong time ago :)Emogene
H
22

Duplicate the string, tokenize it, then free it.

char *dup = strdup(str.c_str());
token = strtok(dup, " ");
free(dup);
Hofmann answered 14/11, 2008 at 6:22 Comment(6)
Isn't the better question, why use strtok when the language in question has better native options?Luminesce
Not necessarily. If the context of the question surrounds maintaining a fragile codebase, then stepping away from the existing approach (notionally strtok in my example) is riskier than changing the approach. Without more context in the question, I prefer to answer what is asked.Hofmann
If the asker is a newbie, you should want against doing free() before using token... :-)Radioactivity
I am dubious that using a more robust native tokenizer is ever less safe than inserting new code that calls a library that inserts nulls into the block of memory passed to it... that's why I did not think it a good idea to answer the question as asked.Luminesce
Note that strtok() is not thread-safe or re-entrant. In an program with multiple tasks, it should be avoided.Let
Also, while we are at it, we should note that strdup() comes from POSIX which is why it may be preferable not to use it.Latchkey
C
20
  1. If boost is available on your system (I think it's standard on most Linux distros these days), it has a Tokenizer class you can use.

  2. If not, then a quick Google turns up a hand-rolled tokenizer for std::string that you can probably just copy and paste. It's very short.

  3. And, if you don't like either of those, then here's a split() function I wrote to make my life easier. It'll break a string into pieces using any of the chars in "delim" as separators. Pieces are appended to the "parts" vector:

    void split(const string& str, const string& delim, vector<string>& parts) {
      size_t start, end = 0;
      while (end < str.size()) {
        start = end;
        while (start < str.size() && (delim.find(str[start]) != string::npos)) {
          start++;  // skip initial whitespace
        }
        end = start;
        while (end < str.size() && (delim.find(str[end]) == string::npos)) {
          end++; // skip to end of word
        }
        if (end-start != 0) {  // just ignore zero-length strings.
          parts.push_back(string(str, start, end-start));
        }
      }
    }
    
Crumhorn answered 14/11, 2008 at 6:28 Comment(1)
The hand-rolled link is brokenAlcoran
C
10

There is a more elegant solution.

With std::string you can use resize() to allocate a suitably large buffer, and &s[0] to get a pointer to the internal buffer.

At this point many fine folks will jump and yell at the screen. But this is the fact. About 2 years ago

the library working group decided (meeting at Lillehammer) that just like for std::vector, std::string should also formally, not just in practice, have a guaranteed contiguous buffer.

The other concern is does strtok() increases the size of the string. The MSDN documentation says:

Each call to strtok modifies strToken by inserting a null character after the token returned by that call.

But this is not correct. Actually the function replaces the first occurrence of a separator character with \0. No change in the size of the string. If we have this string:

one-two---three--four

we will end up with

one\0two\0--three\0-four

So my solution is very simple:


std::string str("some-text-to-split");
char seps[] = "-";
char *token;

token = strtok( &str[0], seps );
while( token != NULL )
{
   /* Do your thing */
   token = strtok( NULL, seps );
}

Read the discussion on http://www.archivum.info/comp.lang.c++/2008-05/02889/does_std::string_have_something_like_CString::GetBuffer

Chemist answered 20/10, 2009 at 6:49 Comment(5)
-1. strtok() works on a null-terminated string while std::string's buffer is not required to be null-terminated. There is no way around c_str().Syncopated
@Syncopated std::string's buffer is required to be null-terminated. data and c_str are required to be identical and data() + i == &operator[](i) for every i in [0, size()].Burmeister
@Leushenko you're partially right. Null-termination is only guaranteed since C++11. I've added a note to the answer. I'll lift my -1 as soon as my edit is accepted.Syncopated
This hack is not worth it. This "elegant" solution wrecks std::string object in a few ways. std::cout << str << " " << str.size(); std::cout << str.c_str()<< " " << strlen(str.c_str()); Before: some-text-to-split 18 some-text-to-split 18 After: sometexttosplit 18 some 4.Ami
what is the use of "token = strtok( NULL, seps )" in the code above.Please answer coz tried to search this use but cudnot get much.Zoon
T
3

With C++17 str::string receives data() overload that returns a pointer to modifieable buffer so string can be used in strtok directly without any hacks:

#include <string>
#include <iostream>
#include <cstring>
#include <cstdlib>

int main()
{
    ::std::string text{"pop dop rop"};
    char const * const psz_delimiter{" "};
    char * psz_token{::std::strtok(text.data(), psz_delimiter)};
    while(nullptr != psz_token)
    {
        ::std::cout << psz_token << ::std::endl;
        psz_token = std::strtok(nullptr, psz_delimiter);
    }
    return EXIT_SUCCESS;
}

output

pop
dop
rop

Tubule answered 8/8, 2019 at 9:48 Comment(2)
note: the original std::string will not hold the same value anymore, as strtok replaces the delimiter it found with a null terminator in place, instead of returning you a copy of the string. if you want to keep the original string, create a copy of the string and pass that into strtok.Chamness
@Chamness note: if strtok handles only a single delimiter then the original value of the string may be preserved by putting back delimiter replacing null terminator on each iteration.Tubule
H
2

EDIT: usage of const cast is only used to demonstrate the effect of strtok() when applied to a pointer returned by string::c_str().

You should not use strtok() since it modifies the tokenized string which may lead to undesired, if not undefined, behaviour as the C string "belongs" to the string instance.

#include <string>
#include <iostream>

int main(int ac, char **av)
{
    std::string theString("hello world");
    std::cout << theString << " - " << theString.size() << std::endl;

    //--- this cast *only* to illustrate the effect of strtok() on std::string 
    char *token = strtok(const_cast<char  *>(theString.c_str()), " ");

    std::cout << theString << " - " << theString.size() << std::endl;

    return 0;
}

After the call to strtok(), the space was "removed" from the string, or turned down to a non-printable character, but the length remains unchanged.

>./a.out
hello world - 11
helloworld - 11

Therefore you have to resort to native mechanism, duplication of the string or an third party library as previously mentioned.

Hotchkiss answered 14/11, 2008 at 7:59 Comment(4)
casting away the const does not help. It is const for a reason.Aranyaka
@Martin York, @Sherm Pendley : did you read the conclusion or only the code snippet ? I edited my answer to clarify what I wanted to show here. Rgds.Hotchkiss
@Philippe - Yes, I only read the code. A lot of people will do that, and go straight to the code and skip the explanation. Perhaps putting the explanation in the code, as a comment, would be a good idea? Anyhow, I removed my down vote.Sapphera
Does anybody know a compiler (Warning-switch) or a static code analyzer that warns about issues like this?Disconcert
R
1

I suppose the language is C, or C++...

strtok, IIRC, replace separators with \0. That's what it cannot use a const string. To workaround that "quickly", if the string isn't huge, you can just strdup() it. Which is wise if you need to keep the string unaltered (what the const suggest...).

On the other hand, you might want to use another tokenizer, perhaps hand rolled, less violent on the given argument.

Radioactivity answered 14/11, 2008 at 6:23 Comment(0)
S
1

Assuming that by "string" you're talking about std::string in C++, you might have a look at the Tokenizer package in Boost.

Sapphera answered 14/11, 2008 at 6:29 Comment(0)
A
0

First off I would say use boost tokenizer.
Alternatively if your data is space separated then the string stream library is very useful.

But both the above have already been covered.
So as a third C-Like alternative I propose copying the std::string into a buffer for modification.

std::string   data("The data I want to tokenize");

// Create a buffer of the correct length:
std::vector<char>  buffer(data.size()+1);

// copy the string into the buffer
strcpy(&buffer[0],data.c_str());

// Tokenize
strtok(&buffer[0]," ");
Aranyaka answered 14/11, 2008 at 10:5 Comment(0)
T
0

If you don't mind open source, you could use the subbuffer and subparser classes from https://github.com/EdgeCast/json_parser. The original string is left intact, there is no allocation and no copying of data. I have not compiled the following so there may be errors.

std::string input_string("hello world");
subbuffer input(input_string);
subparser flds(input, ' ', subparser::SKIP_EMPTY);
while (!flds.empty())
{
    subbuffer fld = flds.next();
    // do something with fld
}

// or if you know it is only two fields
subbuffer fld1 = input.before(' ');
subbuffer fld2 = input.sub(fld1.length() + 1).ltrim(' ');
Tarr answered 11/6, 2015 at 13:30 Comment(0)
C
0

Chris's answer is probably fine when using std::string; however in case you want to use std::basic_string<char16_t>, std::getline can't be used. Here is a possible other implementation:

template <class CharT> bool tokenizestring(const std::basic_string<CharT> &input, CharT separator, typename std::basic_string<CharT>::size_type &pos, std::basic_string<CharT> &token) {
    if (pos >= input.length()) {
        // if input is empty, or ends with a separator, return an empty token when the end has been reached (and return an out-of-bound position so subsequent call won't do it again)
        if ((pos == 0) || ((pos > 0) && (pos == input.length()) && (input[pos-1] == separator))) {
            token.clear();
            pos=input.length()+1;
            return true;
        }
        return false;
    }
    typename std::basic_string<CharT>::size_type separatorPos=input.find(separator, pos);
    if (separatorPos == std::basic_string<CharT>::npos) {
        token=input.substr(pos, input.length()-pos);
        pos=input.length();
    } else {
        token=input.substr(pos, separatorPos-pos);
        pos=separatorPos+1;
    }
    return true;
}

Then use it like this:

std::basic_string<char16_t> s;
std::basic_string<char16_t> token;
std::basic_string<char16_t>::size_type tokenPos=0;
while (tokenizestring(s, (char16_t)' ', tokenPos, token)) {
    ...
}
Constancy answered 17/11, 2021 at 10:52 Comment(0)
T
-1

It fails because str.c_str() returns constant string but char * strtok (char * str, const char * delimiters ) requires volatile string. So you need to use *const_cast< char > inorder to make it voletile. I am giving you a complete but small program to tokenize the string using C strtok() function.

   #include <iostream>
   #include <string>
   #include <string.h> 
   using namespace std;
   int main() {
       string s="20#6 5, 3";
       // strtok requires volatile string as it modifies the supplied string in order to tokenize it 
       char *str=const_cast< char *>(s.c_str());    
       char *tok;
       tok=strtok(str, "#, " );     
       int arr[4], i=0;    
       while(tok!=NULL){
           arr[i++]=stoi(tok);
           tok=strtok(NULL, "#, " );
       }     
       for(int i=0; i<4; i++) cout<<arr[i]<<endl;


       return 0;
   }

NOTE: strtok may not be suitable in all situation as the string passed to function gets modified by being broken into smaller strings. Pls., ref to get better understanding of strtok functionality.

How strtok works

Added few print statement to better understand the changes happning to string in each call to strtok and how it returns token.

#include <iostream>
#include <string>
#include <string.h> 
using namespace std;
int main() {
    string s="20#6 5, 3";
    char *str=const_cast< char *>(s.c_str());    
    char *tok;
    cout<<"string: "<<s<<endl;
    tok=strtok(str, "#, " );     
    cout<<"String: "<<s<<"\tToken: "<<tok<<endl;   
    while(tok!=NULL){
        tok=strtok(NULL, "#, " );
        cout<<"String: "<<s<<"\t\tToken: "<<tok<<endl;
    }
    return 0;
}

Output:

string: 20#6 5, 3

String: 206 5, 3    Token: 20
String: 2065, 3     Token: 6
String: 2065 3      Token: 5
String: 2065 3      Token: 3
String: 2065 3      Token: 

strtok iterate over the string first call find the non delemetor character (2 in this case) and marked it as token start then continues scan for a delimeter and replace it with null charater (# gets replaced in actual string) and return start which points to token start character( i.e., it return token 20 which is terminated by null). In subsequent call it start scaning from the next character and returns token if found else null. subsecuntly it returns token 6, 5, 3.

Tannate answered 16/10, 2015 at 20:38 Comment(3)
FYI: strtok will change the value of s. You should not use const_cast, since this simply hides an issue.Disconcert
This causes undefined behaviour by using the result of c_str() to modify the stringWorked
@Worked added more clarification and working of the strtok function. Hope it will help people understard when to use itTannate
L
-1

Typecasting to (char*) got it working for me!

token = strtok((char *)str.c_str(), " "); 
Lindbergh answered 25/11, 2020 at 15:5 Comment(2)
This will not work. strtok will modifying the internal of str. I suppose it is a side effect the user doesn't want. The solution is to create a char buffer and copy first the str sting into the buffer.Reflexion
"got it working" isn't true. It silenced the compiler, and now you have a piece of (invalid) code that every compiler will treat like valid code.Claudy
V
-1

using std:wstring.find_first_of() and std::wstring.substr().

std::wstring can be replaced by std:string and const wchar_t by const char.

#include <iostream>
using namespace std;

size_t __wstok(wstring * ws_mystring , wstring * ws_word ,  const wchar_t c)
{//size_t __wstok
   wstring mywstr = * ws_mystring;
   size_t found = mywstr.find_first_of(c) ;

      if (found != wstring::npos)
      {//if (found != wstring::npos)
      *ws_word =  mywstr.substr(0,found) ;
      *ws_mystring = mywstr.substr(found+1 , mywstr.size() );
      }//if (found != wstring::npos)

       if (found == wstring::npos)
       *ws_word = mywstr;

return(found);
}//size_t __wstok

// main
int main()
{
wstring a_wstring = L"every good boy deserves fudge"; 
wstring a_word; // the string where the result is stored every time.

    while (__wstok(&a_wstring, &a_word, L' ' ) != wstring::npos)
    {//while
    wcout <<  a_word.c_str() << L"\n\n";
    }//while
   wcout <<  a_word.c_str() << L"\n\n"; // last string

return(0);
}

(output)

every

good

boy

deserves

fudge

Vanvanadate answered 18/10, 2023 at 19:5 Comment(1)
Welcome to the site! That's a lot of raw pointer arithmetic for a simple task like string tokenization. If the behavior of the C strtok function is really needed, the other answers have it covered (basically use the non-const overload of .data()). If the goal is simply tokenization, then a lot of the pointer trickery can be avoided. In general, I try to recommend modern C++ best practices, and I can't really recommend programming in this style in C++ in 2023.Nissa

© 2022 - 2024 — McMap. All rights reserved.