How do I tokenize a string in C++?
Asked Answered
M

37

478

Java has a convenient split method:

String str = "The quick brown fox";
String[] results = str.split(" ");

Is there an easy way to do this in C++?

Methylal answered 10/9, 2008 at 12:10 Comment(2)
Possible duplicate of Split a string in C++?Rabush
A solution to exact this question seems to be here: riptutorial.com/cplusplus/example/2148/tokenizeKurt
M
177

C++ standard library algorithms are pretty universally based around iterators rather than concrete containers. Unfortunately this makes it hard to provide a Java-like split function in the C++ standard library, even though nobody argues that this would be convenient. But what would its return type be? std::vector<std::basic_string<…>>? Maybe, but then we’re forced to perform (potentially redundant and costly) allocations.

Instead, C++ offers a plethora of ways to split strings based on arbitrarily complex delimiters, but none of them is encapsulated as nicely as in other languages. The numerous ways fill whole blog posts.

At its simplest, you could iterate using std::string::find until you hit std::string::npos, and extract the contents using std::string::substr.

A more fluid (and idiomatic, but basic) version for splitting on whitespace would use a std::istringstream:

auto iss = std::istringstream{"The quick brown fox"};
auto str = std::string{};

while (iss >> str) {
    process(str);
}

Using std::istream_iterators, the contents of the string stream could also be copied into a vector using its iterator range constructor.

Multiple libraries (such as Boost.Tokenizer) offer specific tokenisers.

More advanced splitting require regular expressions. C++ provides the std::regex_token_iterator for this purpose in particular:

auto const str = "The quick brown fox"s;
auto const re = std::regex{R"(\s+)"};
auto const vec = std::vector<std::string>(
    std::sregex_token_iterator{begin(str), end(str), re, -1},
    std::sregex_token_iterator{}
);
Murry answered 10/9, 2008 at 12:18 Comment(23)
Sadly, boost is not always available for all projects. I'll have to look for a non-boost answer.Barrick
Not every project is open to "open source". I work in heavily regulated industries. It's not a problem, really. It's just a fact of life. Boost is not available everywhere.Barrick
@NonlinearIdeas The other question / answer wasn’t about Open Source projects at all. The same is true for any project. That said, I of course understand about restricted standards such as MISRA C but then it’s understood that you build everything from scratch anyway (unless you happen to find a compliant library – a rarity). Anyway, the point is hardly that “Boost is not available” – it’s that you have special requirements for which almost any general-purpose answer would be unsuitable.Murry
@NonlinearIdeas Case in point, the other, non-Boost answers are also not MISRA compliant.Murry
This discussion sparked me to ask about my specific industry of concern: #20714509.Barrick
@Barrick Programming competitions don't allow you to use libraries at all.Dug
I tried to install boost 3 times, each of these times I was discourged by STL barf. It's 2016, can't we just replace C preprocessor with PHP or JavaScript and embed it into our C++ files so that we use real transformation functions that transform string to string and give a sane error if something goes wrong, instead of the archaic #ifdef #ifndef that don't even make sense in C/C++ since the language wants to be condensable to "a single line" but macros require their own lines. Even Perl as a preprocessor would be better than using #ifdefs/#ifndef/template, and it's non blackbox transforms.Frosting
@Dmitry What’s “STL barf”?! And the whole community is very much in favour of replacing the C preprocessor — in fact, there are proposals to do that. But your suggestion to use PHP or some other language instead would be a huge step backwards.Murry
@KonradRudolph by STL barf I ~2-3 screens expanded templates which shows an error often in code that is not in your program which is difficult to figure out what went wrong for people who aren't used to it. That said I've learnt a lot since then so I guess i'll try setting boost up again, it's something that I should be able to set up.Frosting
But what would its return type be? How about returning what Java and Python both return: an array of string. Seems like common sense.Telluric
@Telluric And what is an “arrray of strings” in C++, pray tell? Even in Java the decision to return an array is actually problematic. In C++ it would be a complete non-starter in a generic library, since C++ has so many different array (and string) types, and a generic C++ library can’t just assume one, because it would clash with lots of client code.Murry
@KonradRudolph: An array of strings looks like this: string result[42]. You can return a pointer to a heap-allocated array of string. Or you can return a pointer to an array of pointer to string. Or you can return a std::vector<string> by value. If C++ doesn't have a way to easily perform a string split to return a sequence of string tokens, then there is an underlying problem with the language runtime. I worked with C++ for close to ten years. I hence moved to Java and Python and never looked back. I'm still amazed C++ hasn't come up a solution all this time.Telluric
@Telluric Surely you must be aware that you cannot return C arrays by value in C++. You could technically return manually managed memory but this is completely unacceptable for a general-purpose standard algorithm. It’s atrocious for code quality, and nothing else in C++ does this (except for legacy C compatibility functions). And even std::vector is not a good generic type, and for this reason no C++ standard library algorithm returns it.Murry
@Telluric And contrary to your claim it’s absolutely no problem to split strings into tokens in C++. My answer shows several ways to do this trivially. And C++ has added more ways since. But none of them returns an array of strings, for reasons explained.Murry
@KonradRudolph: And what's wrong with returning a std::vector by value out of a function?Telluric
@Telluric It locks the user in to use a specific container, which is anathema to the whole design of the C++ algorithms library. There are many situations where std::vector can’t or won’t be used.Murry
@KonradRudolph: A string splitter should return a sequence of tokens. vector returns a sequence of things. They are a match for one another.Telluric
@Telluric Again, not every code base uses std::vector, and even when a codebase uses std::vector for some parts, it’s not appropriate everywhere. std::vector isn’t the only container that models “sequence of X”, and users may not want to use it here. The algorithm can’t know. And generic algorithms in C++ are specifically designed to be container agnostic, to be, well, generic. There’s nothing wrong with implementing a string splitter to return std::vector<std::string> for specific uses, but this implementation has no place in the C++ algorithms standard library.Murry
@KonradRudolph: If someone wants the tokens in another data structure, like a map, then they can write their own code to convert the return value of the splitter to their desired data structure. The point is that the return value of a splitter has to be in some data structure, and the most logical one is vector. This is just common sense. Java's split returns an array of string; if you want the result in a map, then you need to write your own code to convert the array of string to a map.Telluric
@Telluric It fundamentally breaks C++’s design model, which is to provide zero-cost abstractions. The abstraction you are proposing would be anything but zero-cost. Frankly, at this point you’re just arguing for argument’s sake, and this discussion is completely unproductive (especially since it just rehashes what the answer already says). There’s no actual issue here: the solutions presented in my answer (and elsewhere) work.Murry
What do you propose would be a "zero-cost abstraction" for a string splitter? An iterator over strings?Telluric
I recommend you delete your answer before others read this bad advice. Please stick to PHP and HTML.Telluric
@Telluric You should read the answer, it addresses that already. Anyway, if you don’t like the way C++ works you’re indeed free to stick to PHP and HTML, I won’t stop you.Murry
T
194

The Boost tokenizer class can make this sort of thing quite simple:

#include <iostream>
#include <string>
#include <boost/foreach.hpp>
#include <boost/tokenizer.hpp>

using namespace std;
using namespace boost;

int main(int, char**)
{
    string text = "token, test   string";

    char_separator<char> sep(", ");
    tokenizer< char_separator<char> > tokens(text, sep);
    BOOST_FOREACH (const string& t, tokens) {
        cout << t << "." << endl;
    }
}

Updated for C++11:

#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>

using namespace std;
using namespace boost;

int main(int, char**)
{
    string text = "token, test   string";

    char_separator<char> sep(", ");
    tokenizer<char_separator<char>> tokens(text, sep);
    for (const auto& t : tokens) {
        cout << t << "." << endl;
    }
}
Tempered answered 11/9, 2008 at 2:10 Comment(7)
Good stuff, I've recently utilized this. My Visual Studio compiler has an odd whinge until I use a whitespace to separate the two ">" characters before the tokens(text, sep) bit: (error C2947: expecting '>' to terminate template-argument-list, found '>>')Aggappora
@Aggappora yes, without the space the compiler parses it as an extraction operator rather than two closing templates.Forepleasure
Theoretically that's been fixed in C++0xDosage
beware of the third parameters of the char_separator constructor (drop_empty_tokens is the default, alternative is keep_empty_tokens).Vanegas
@DavidSouther Talking about C++0x - the BOOST_FOREACH could now be replaced by the new for loop syntax. for ( string t : tokens ) { ... }Electrotherapy
The double >> shouldn't be a problem with C++11, if so then VS2010 compilers+ have a bug and you should file it on connect.Ladysmith
@puk - It's a commonly used suffix for C++ header files. (like .h for C headers)Tempered
R
183

Here's a real simple one:

#include <vector>
#include <string>

vector<string> split(const char *str, char c = ' ')
{
    std::vector<std::string> result;

    do
    {
        const char *begin = str;

        while(*str != c && *str)
            str++;

        result.push_back(std::string(begin, str));
    } while (0 != *str++);

    return result;
}
Rissa answered 10/9, 2008 at 12:30 Comment(5)
do I need to add a prototype for this method in .h file ?Pelvis
This is not exactly the "best" answer as it still uses a string literal which is the plain C constant character array. I believe the questioner was asking if he could tokenize a C++ string which is of type "string" introduced by the latter.Asha
This needs a new answer because I strongly suspect the inclusion of regular expressions in C++11 has changed what the best answer would be.Broussard
To this answer have problem with strings that the first/last char is equal to the separator. e.g. the string " a" results is [" ", "a"].Skulk
@y30: I think it results in ["","a"].Marcie
M
177

C++ standard library algorithms are pretty universally based around iterators rather than concrete containers. Unfortunately this makes it hard to provide a Java-like split function in the C++ standard library, even though nobody argues that this would be convenient. But what would its return type be? std::vector<std::basic_string<…>>? Maybe, but then we’re forced to perform (potentially redundant and costly) allocations.

Instead, C++ offers a plethora of ways to split strings based on arbitrarily complex delimiters, but none of them is encapsulated as nicely as in other languages. The numerous ways fill whole blog posts.

At its simplest, you could iterate using std::string::find until you hit std::string::npos, and extract the contents using std::string::substr.

A more fluid (and idiomatic, but basic) version for splitting on whitespace would use a std::istringstream:

auto iss = std::istringstream{"The quick brown fox"};
auto str = std::string{};

while (iss >> str) {
    process(str);
}

Using std::istream_iterators, the contents of the string stream could also be copied into a vector using its iterator range constructor.

Multiple libraries (such as Boost.Tokenizer) offer specific tokenisers.

More advanced splitting require regular expressions. C++ provides the std::regex_token_iterator for this purpose in particular:

auto const str = "The quick brown fox"s;
auto const re = std::regex{R"(\s+)"};
auto const vec = std::vector<std::string>(
    std::sregex_token_iterator{begin(str), end(str), re, -1},
    std::sregex_token_iterator{}
);
Murry answered 10/9, 2008 at 12:18 Comment(23)
Sadly, boost is not always available for all projects. I'll have to look for a non-boost answer.Barrick
Not every project is open to "open source". I work in heavily regulated industries. It's not a problem, really. It's just a fact of life. Boost is not available everywhere.Barrick
@NonlinearIdeas The other question / answer wasn’t about Open Source projects at all. The same is true for any project. That said, I of course understand about restricted standards such as MISRA C but then it’s understood that you build everything from scratch anyway (unless you happen to find a compliant library – a rarity). Anyway, the point is hardly that “Boost is not available” – it’s that you have special requirements for which almost any general-purpose answer would be unsuitable.Murry
@NonlinearIdeas Case in point, the other, non-Boost answers are also not MISRA compliant.Murry
This discussion sparked me to ask about my specific industry of concern: #20714509.Barrick
@Barrick Programming competitions don't allow you to use libraries at all.Dug
I tried to install boost 3 times, each of these times I was discourged by STL barf. It's 2016, can't we just replace C preprocessor with PHP or JavaScript and embed it into our C++ files so that we use real transformation functions that transform string to string and give a sane error if something goes wrong, instead of the archaic #ifdef #ifndef that don't even make sense in C/C++ since the language wants to be condensable to "a single line" but macros require their own lines. Even Perl as a preprocessor would be better than using #ifdefs/#ifndef/template, and it's non blackbox transforms.Frosting
@Dmitry What’s “STL barf”?! And the whole community is very much in favour of replacing the C preprocessor — in fact, there are proposals to do that. But your suggestion to use PHP or some other language instead would be a huge step backwards.Murry
@KonradRudolph by STL barf I ~2-3 screens expanded templates which shows an error often in code that is not in your program which is difficult to figure out what went wrong for people who aren't used to it. That said I've learnt a lot since then so I guess i'll try setting boost up again, it's something that I should be able to set up.Frosting
But what would its return type be? How about returning what Java and Python both return: an array of string. Seems like common sense.Telluric
@Telluric And what is an “arrray of strings” in C++, pray tell? Even in Java the decision to return an array is actually problematic. In C++ it would be a complete non-starter in a generic library, since C++ has so many different array (and string) types, and a generic C++ library can’t just assume one, because it would clash with lots of client code.Murry
@KonradRudolph: An array of strings looks like this: string result[42]. You can return a pointer to a heap-allocated array of string. Or you can return a pointer to an array of pointer to string. Or you can return a std::vector<string> by value. If C++ doesn't have a way to easily perform a string split to return a sequence of string tokens, then there is an underlying problem with the language runtime. I worked with C++ for close to ten years. I hence moved to Java and Python and never looked back. I'm still amazed C++ hasn't come up a solution all this time.Telluric
@Telluric Surely you must be aware that you cannot return C arrays by value in C++. You could technically return manually managed memory but this is completely unacceptable for a general-purpose standard algorithm. It’s atrocious for code quality, and nothing else in C++ does this (except for legacy C compatibility functions). And even std::vector is not a good generic type, and for this reason no C++ standard library algorithm returns it.Murry
@Telluric And contrary to your claim it’s absolutely no problem to split strings into tokens in C++. My answer shows several ways to do this trivially. And C++ has added more ways since. But none of them returns an array of strings, for reasons explained.Murry
@KonradRudolph: And what's wrong with returning a std::vector by value out of a function?Telluric
@Telluric It locks the user in to use a specific container, which is anathema to the whole design of the C++ algorithms library. There are many situations where std::vector can’t or won’t be used.Murry
@KonradRudolph: A string splitter should return a sequence of tokens. vector returns a sequence of things. They are a match for one another.Telluric
@Telluric Again, not every code base uses std::vector, and even when a codebase uses std::vector for some parts, it’s not appropriate everywhere. std::vector isn’t the only container that models “sequence of X”, and users may not want to use it here. The algorithm can’t know. And generic algorithms in C++ are specifically designed to be container agnostic, to be, well, generic. There’s nothing wrong with implementing a string splitter to return std::vector<std::string> for specific uses, but this implementation has no place in the C++ algorithms standard library.Murry
@KonradRudolph: If someone wants the tokens in another data structure, like a map, then they can write their own code to convert the return value of the splitter to their desired data structure. The point is that the return value of a splitter has to be in some data structure, and the most logical one is vector. This is just common sense. Java's split returns an array of string; if you want the result in a map, then you need to write your own code to convert the array of string to a map.Telluric
@Telluric It fundamentally breaks C++’s design model, which is to provide zero-cost abstractions. The abstraction you are proposing would be anything but zero-cost. Frankly, at this point you’re just arguing for argument’s sake, and this discussion is completely unproductive (especially since it just rehashes what the answer already says). There’s no actual issue here: the solutions presented in my answer (and elsewhere) work.Murry
What do you propose would be a "zero-cost abstraction" for a string splitter? An iterator over strings?Telluric
I recommend you delete your answer before others read this bad advice. Please stick to PHP and HTML.Telluric
@Telluric You should read the answer, it addresses that already. Anyway, if you don’t like the way C++ works you’re indeed free to stick to PHP and HTML, I won’t stop you.Murry
T
150

Another quick way is to use getline. Something like:

std::istringstream iss(str);
std::string s;

while (std::getline(iss, s, ' ')) {
  std::cout << s << std::endl;
}

If you want, you can make a simple split() method returning a std::vector<string>, which is really useful.

Terrance answered 28/11, 2008 at 4:17 Comment(6)
I had problems using this technique with 0x0A characters in the string which made the while loop exit prematurely. Otherwise, it's a nice simple and quick solution.Turkic
This is good but just have to keep in mind that by doing this the default delimiter '\n' is not considered. This example will work, but if you are using something like : while(getline(inFile,word,' ')) where inFile is ifstream object containing multiple lines you will get funnny results..Glenglencoe
it's too bad getline returns the stream rather than the string, making it unusable in initialization lists without temporary storageArmored
Cool! No boost and C++11, good solution to the those legacy projects out there!Walkerwalkietalkie
THAT is the answer, the name of the function is just a bit awkward.Wellpreserved
If the string ends with a delimiter you won't get a final empty entry. This is because lines are by definition end with a end-of-line character.Fleet
C
119

Use strtok. In my opinion, there isn't a need to build a class around tokenizing unless strtok doesn't provide you with what you need. It might not, but in 15+ years of writing various parsing code in C and C++, I've always used strtok. Here is an example

char myString[] = "The quick brown fox";
char *p = strtok(myString, " ");
while (p) {
    printf ("Token: %s\n", p);
    p = strtok(NULL, " ");
}

A few caveats (which might not suit your needs). The string is "destroyed" in the process, meaning that EOS characters are placed inline in the delimter spots. Correct usage might require you to make a non-const version of the string. You can also change the list of delimiters mid parse.

In my own opinion, the above code is far simpler and easier to use than writing a separate class for it. To me, this is one of those functions that the language provides and it does it well and cleanly. It's simply a "C based" solution. It's appropriate, it's easy, and you don't have to write a lot of extra code :-)

Conformity answered 10/9, 2008 at 13:37 Comment(10)
Not that I dislike C, however strtok is not thread-safe, and you need to be certain that the string you send it contains a null character to avoid a possible buffer overflow.Roxana
There is strtok_r, but this was a C++ question.Menace
@tloach: in MS C++ compiler strtok is thread safe as the internal static variable is created on the TLS (thread local storage) (actually it is compiler depended)Clatter
you can check this nibuthomas.wordpress.com/2008/06/25/…Clatter
@ahmed: thread safe means more than just being able to run the function twice in different threads. In this case if the thread is modified while strtok is running it's possible to have the string be valid during the entire run of strtok, but strtok will still mess up because the string changed, it's now already past the null character, and it's going to keep reading memory until it either gets a security violation or finds a null character. This is a problem with the original C string functions, if you don't specify a length somewhere you run into problems.Roxana
strtok requires a pointer to a non-const null-terminated char array, which is not a common creature to find in c++ code ... what's your favourite way to convert to this from a std::string?Armored
@tloach: under what circumstances would a string not contain a null character? I only know that, sometimes, '\0' is appended to strings... (Noob question, I know!)Crisscross
This is a nice solution and good in C projects. However it's not compatible with C++ string type literals or objects. Looks like C++ requires a class for everything. >_<Asha
strtok_r is thread-safe and better to use when available. Just try it to see if your compiler supports it. It probably does.Schober
@Armored I am not an advocate of using strtok in C++ code. However, when I have needed to convert legacy code using it to avoid naked pointers (and their non-RAII lack of guarantees that what they point to will be deallocated if an exception gets thrown), I have used "vector<char> foo(str.begin(), str.end()+1); char *p = strtok(foo.data(), " ");" That way, the vector owns a writable copy of the string's data and will release it if an exception is thrown. (The +1 is to ensure that the string's null terminator gets copied into the vector.)Eniwetok
P
86

You can use streams, iterators, and the copy algorithm to do this fairly directly.

#include <string>
#include <vector>
#include <iostream>
#include <istream>
#include <ostream>
#include <iterator>
#include <sstream>
#include <algorithm>

int main()
{
  std::string str = "The quick brown fox";

  // construct a stream from the string
  std::stringstream strstr(str);

  // use stream iterators to copy the stream to the vector as whitespace separated strings
  std::istream_iterator<std::string> it(strstr);
  std::istream_iterator<std::string> end;
  std::vector<std::string> results(it, end);

  // send the vector to stdout.
  std::ostream_iterator<std::string> oit(std::cout);
  std::copy(results.begin(), results.end(), oit);
}
Player answered 10/9, 2008 at 12:46 Comment(12)
I find those std:: irritating to read.. why not use "using" ?Terrance
@Vadi: because editing someone else's post is quite intrusive. @pheze: I prefer to let the std this way I know where my object comes from, that's merely a matter of style.Campbell
I understand your reason and I think it's actually a good choice if it works for you, but from a pedagogical standpoint I actually agree with pheze. It's easier to read and understand a completely foreign example like this one with a "using namespace std" at the top because it requires less effort to interpret the following lines... especially in this case because everything is from the standard library. You can make it easy to read and obvious where the objects come from by a series of "using std::string;" etc. Especially since the function is so short.Semicircle
Honestly, in function way this would be much more usable. And those std:: are simply ugly. I did an edit, but I don't expect it to show up somewhere in near future. Nevertheless, I'll try it out.Hayman
user - because someone may copy this example and we wouldn't want them to use 'using' :)Sopher
Despite the "std::" prefixes being irritating or ugly, it's best to include them in example code so that it's completely clear where these functions are coming from. If they bother you, it's trivial to replace them with a "using" after you steal the example and claim it as your own.Oleum
yep! what he said! best practices is to use the std prefix. Any large code base is no doubt going to have it's own libraries and namespaces and using "using namespace std" will give you headaches when you start causing namespace conflicts.Haaf
This is very clever but how could I split the string by, say, a comma rather than a space or is whitespace the only separator - I ask because my quick search online about how this works failed to mention how strings are split.Overcasting
Could you update your answer for std::wstrings? I seem to be too dumb to make it compile :(Philip
You don't need to #include <istream> or <ostream>.Trev
@Haaf Especially when using Boost and the standard library (<regex>) which can be boost::regex or std::regexNervine
This shows the beauty of c++. You need 8 include files, 5 different types, and best of all... nowhere in the code is it actually readable that all it's doing is splitting a string on whitespace. Without knowledge of the specific behavior of these STL classes this is completely unintelligible.Handstand
I
77

A solution using regex_token_iterators:

#include <iostream>
#include <regex>
#include <string>

using namespace std;

int main()
{
    string str("The quick brown fox");

    regex reg("\\s+");

    sregex_token_iterator iter(str.begin(), str.end(), reg, -1);
    sregex_token_iterator end;

    vector<string> vec(iter, end);

    for (auto a : vec)
    {
        cout << a << endl;
    }
}
Itinerancy answered 14/12, 2014 at 10:46 Comment(6)
This should be the top ranked answer. This is the right way to do this in C++ >= 11.Broussard
I'm glad I've scrolled all the way down to this answer (currently only had 9 upvotes). This is exactly what a C++11 code should look like for this task!Philip
Excellent answer that does not rely on external libraries and uses already available librariesSeale
Great answer, giving the most flexibility in delimiters. A few caveats: Using \s+ regex avoids empty tokens in the middle of the text, but does give an empty first token if the text starts with whitespace. Also, regex seems slow: on my laptop, for 20 MB of random text, it takes 0.6 sec, compared to 0.014 sec for strtok, strsep, or Parham's answer using str.find_first_of, or 0.027 sec for Perl, or 0.021 sec for Python. For short text, speed may not be a concern.Howze
This is awesome. Thanks for sharing.Merodach
Ok maybe it looks cool, but this is clearly overuse of regular expressions. Reasonable only if you do not care about performance.Buffon
M
50

No offense folks, but for such a simple problem, you are making things way too complicated. There are a lot of reasons to use Boost. But for something this simple, it's like hitting a fly with a 20# sledge.

void
split( vector<string> & theStringVector,  /* Altered/returned value */
       const  string  & theString,
       const  string  & theDelimiter)
{
    UASSERT( theDelimiter.size(), >, 0); // My own ASSERT macro.

    size_t  start = 0, end = 0;

    while ( end != string::npos)
    {
        end = theString.find( theDelimiter, start);

        // If at end, use length=maxLength.  Else use length=end-start.
        theStringVector.push_back( theString.substr( start,
                       (end == string::npos) ? string::npos : end - start));

        // If at end, use start=maxSize.  Else use start=end+delimiter.
        start = (   ( end > (string::npos - theDelimiter.size()) )
                  ?  string::npos  :  end + theDelimiter.size());
    }
}

For example (for Doug's case),

#define SHOW(I,X)   cout << "[" << (I) << "]\t " # X " = \"" << (X) << "\"" << endl

int
main()
{
    vector<string> v;

    split( v, "A:PEP:909:Inventory Item", ":" );

    for (unsigned int i = 0;  i < v.size();   i++)
        SHOW( i, v[i] );
}

And yes, we could have split() return a new vector rather than passing one in. It's trivial to wrap and overload. But depending on what I'm doing, I often find it better to re-use pre-existing objects rather than always creating new ones. (Just as long as I don't forget to empty the vector in between!)

Reference: http://www.cplusplus.com/reference/string/string/.

(I was originally writing a response to Doug's question: C++ Strings Modifying and Extracting based on Separators (closed). But since Martin York closed that question with a pointer over here... I'll just generalize my code.)

Microbicide answered 28/11, 2008 at 2:55 Comment(6)
Why define a macro you only use in one place. And how is your UASSERT any better than standard assert. Splitting up the comparison into 3 tokens like that does nothing other than require more commas than you'd otherwise need.Boastful
Maybe the UASSERT macro shows (in the error message) the actual relationship between (and values of) the two compared values? That's actually a pretty good idea, IMHO.Weiman
Ugh, why doesn't the std::string class include a split() function?Mixed
I think the last line in the while loop should be start = ((end > (theString.size() - theDelimiter.size())) ? string::npos : end + theDelimiter.size()); and the while loop should be while (start != string::npos). Also, I check the substring to be sure it's not empty before inserting it into the vector.Modla
@JohnK If the input has two consecutive delimiters, then clearly the string between them is empty, and should be inserted into the vector. If empty values are not acceptable for a particular purpose, that is another thing, but IMHO such constraints should be enforced outside this kind of a very general purpose functions.Sectarianize
Why not allow empty string as a delimiter too?Inquisition
P
39

Boost has a strong split function: boost::algorithm::split.

Sample program:

#include <vector>
#include <boost/algorithm/string.hpp>

int main() {
    auto s = "a,b, c ,,e,f,";
    std::vector<std::string> fields;
    boost::split(fields, s, boost::is_any_of(","));
    for (const auto& field : fields)
        std::cout << "\"" << field << "\"\n";
    return 0;
}

Output:

"a"
"b"
" c "
""
"e"
"f"
""
Protestation answered 12/9, 2008 at 17:20 Comment(0)
P
29

This is a simple STL-only solution (~5 lines!) using std::find and std::find_first_not_of that handles repetitions of the delimiter (like spaces or periods for instance), as well leading and trailing delimiters:

#include <string>
#include <vector>

void tokenize(std::string str, std::vector<string> &token_v){
    size_t start = str.find_first_not_of(DELIMITER), end=start;

    while (start != std::string::npos){
        // Find next occurence of delimiter
        end = str.find(DELIMITER, start);
        // Push back the token found into vector
        token_v.push_back(str.substr(start, end-start));
        // Skip all occurences of the delimiter to find new start
        start = str.find_first_not_of(DELIMITER, end);
    }
}

Try it out live!

Physiology answered 28/2, 2015 at 23:18 Comment(5)
This is a good one but I think you need to use find_first_of() instead of find() for this to work properly with multiple delimiters.Scad
@Scad multiple delimiters are skipped when finding the start position with find_first_not_of.Ovenware
I voted "Up" for Parham's solution and modified it a bit: std::vector<string> tokenize(const std::string& str, const string& delimiters) { std::vector<string> result; size_t start = str.find_first_not_of(DELIMITER), end = start; while (start != std::string::npos) { // Find next occurrence of delimiter end = str.find(DELIMITER, start); // Push back the token found into vector result.push_back(str.substr(start, end - start)); // Skip all occurrences of the delimiter to find new start start = str.find_first_not_of(DELIMITER, end); } return result; }Fencer
Replaced "DELIMITER" with "delimiters": std::vector<string> tokenize(const std::string& str, const string& delimiters) { std::vector<string> result; size_t start = str.find_first_not_of(delimiters), end = start; while (start != std::string::npos) { // Find next occurrence of delimiter end = str.find(delimiters, start); // Push back the token found into vector result.push_back(str.substr(start, end - start)); // Skip all occurrences of the delimiter to find new start start = str.find_first_not_of(delimiters, end); } return result; }Fencer
@Parham, sadly this also seems to skip empty fields, such as: ` a,b,c,,,f,g` Which returns a b c f g as the 5 members of the vector instead of including the empty strings. Indexed content suffers. :( It is pretty common in things such as NMEA GPS sentences to have multiple empty fields in it's indexed, character separated string data.Kohn
S
28

I know you asked for a C++ solution, but you might consider this helpful:

Qt

#include <QString>

...

QString str = "The quick brown fox"; 
QStringList results = str.split(" "); 

The advantage over Boost in this example is that it's a direct one to one mapping to your post's code.

See more at Qt documentation

Somali answered 4/8, 2010 at 17:34 Comment(0)
S
23

Here is a sample tokenizer class that might do what you want

//Header file
class Tokenizer 
{
    public:
        static const std::string DELIMITERS;
        Tokenizer(const std::string& str);
        Tokenizer(const std::string& str, const std::string& delimiters);
        bool NextToken();
        bool NextToken(const std::string& delimiters);
        const std::string GetToken() const;
        void Reset();
    protected:
        size_t m_offset;
        const std::string m_string;
        std::string m_token;
        std::string m_delimiters;
};

//CPP file
const std::string Tokenizer::DELIMITERS(" \t\n\r");

Tokenizer::Tokenizer(const std::string& s) :
    m_string(s), 
    m_offset(0), 
    m_delimiters(DELIMITERS) {}

Tokenizer::Tokenizer(const std::string& s, const std::string& delimiters) :
    m_string(s), 
    m_offset(0), 
    m_delimiters(delimiters) {}

bool Tokenizer::NextToken() 
{
    return NextToken(m_delimiters);
}

bool Tokenizer::NextToken(const std::string& delimiters) 
{
    size_t i = m_string.find_first_not_of(delimiters, m_offset);
    if (std::string::npos == i) 
    {
        m_offset = m_string.length();
        return false;
    }

    size_t j = m_string.find_first_of(delimiters, i);
    if (std::string::npos == j) 
    {
        m_token = m_string.substr(i);
        m_offset = m_string.length();
        return true;
    }

    m_token = m_string.substr(i, j - i);
    m_offset = j;
    return true;
}

Example:

std::vector <std::string> v;
Tokenizer s("split this string", " ");
while (s.NextToken())
{
    v.push_back(s.GetToken());
}
Shellans answered 10/9, 2008 at 12:18 Comment(0)
H
16

pystring is a small library which implements a bunch of Python's string functions, including the split method:

#include <string>
#include <vector>
#include "pystring.h"

std::vector<std::string> chunks;
pystring::split("this string", chunks);

// also can specify a separator
pystring::split("this-string", chunks, "-");
Helpful answered 29/12, 2011 at 15:17 Comment(2)
Wow, you have answered my immediate question and many future questions. I get that c++ is powerful. But when splitting a string results in source code like the above answers, it is plainly disheartening. I would love to know of other libraries like this that pull higher level langauges conveniences down.Zygosis
wow, you seriously just made my day!! did not know about pystring. this is going to save me a lot of time!Fiance
M
12

If you're using C++ ranges - the full ranges-v3 library, not the limited functionality accepted into C++20 - you could do it this way:

auto results = str | ranges::views::tokenize(" ",1);

... and this is lazily-evaluated. You can alternatively set a vector to this range:

auto results = str | ranges::views::tokenize(" ",1) | ranges::to<std::vector>();

this will take O(m) space and O(n) time if str has n characters making up m words.

See also the library's own tokenization example, here.

Marcie answered 15/8, 2020 at 22:49 Comment(0)
S
11

I posted this answer for similar question.
Don't reinvent the wheel. I've used a number of libraries and the fastest and most flexible I have come across is: C++ String Toolkit Library.

Here is an example of how to use it that I've posted else where on the stackoverflow.

#include <iostream>
#include <vector>
#include <string>
#include <strtk.hpp>

const char *whitespace  = " \t\r\n\f";
const char *whitespace_and_punctuation  = " \t\r\n\f;,=";

int main()
{
    {   // normal parsing of a string into a vector of strings
       std::string s("Somewhere down the road");
       std::vector<std::string> result;
       if( strtk::parse( s, whitespace, result ) )
       {
           for(size_t i = 0; i < result.size(); ++i )
            std::cout << result[i] << std::endl;
       }
    }

    {  // parsing a string into a vector of floats with other separators
       // besides spaces

       std::string s("3.0, 3.14; 4.0");
       std::vector<float> values;
       if( strtk::parse( s, whitespace_and_punctuation, values ) )
       {
           for(size_t i = 0; i < values.size(); ++i )
            std::cout << values[i] << std::endl;
       }
    }

    {  // parsing a string into specific variables

       std::string s("angle = 45; radius = 9.9");
       std::string w1, w2;
       float v1, v2;
       if( strtk::parse( s, whitespace_and_punctuation, w1, v1, w2, v2) )
       {
           std::cout << "word " << w1 << ", value " << v1 << std::endl;
           std::cout << "word " << w2 << ", value " << v2 << std::endl;
       }
    }

    return 0;
}
Sinistrality answered 7/1, 2014 at 20:33 Comment(0)
B
9

Adam Pierce's answer provides an hand-spun tokenizer taking in a const char*. It's a bit more problematic to do with iterators because incrementing a string's end iterator is undefined. That said, given string str{ "The quick brown fox" } we can certainly accomplish this:

auto start = find(cbegin(str), cend(str), ' ');
vector<string> tokens{ string(cbegin(str), start) };

while (start != cend(str)) {
    const auto finish = find(++start, cend(str), ' ');

    tokens.push_back(string(start, finish));
    start = finish;
}

Live Example


If you're looking to abstract complexity by using standard functionality, as On Freund suggests strtok is a simple option:

vector<string> tokens;

for (auto i = strtok(data(str), " "); i != nullptr; i = strtok(nullptr, " ")) tokens.push_back(i);

If you don't have access to C++17 you'll need to substitute data(str) as in this example: http://ideone.com/8kAGoa

Though not demonstrated in the example, strtok need not use the same delimiter for each token. Along with this advantage though, there are several drawbacks:

  1. strtok cannot be used on multiple strings at the same time: Either a nullptr must be passed to continue tokenizing the current string or a new char* to tokenize must be passed (there are some non-standard implementations which do support this however, such as: strtok_s)
  2. For the same reason strtok cannot be used on multiple threads simultaneously (this may however be implementation defined, for example: Visual Studio's implementation is thread safe)
  3. Calling strtok modifies the string it is operating on, so it cannot be used on const strings, const char*s, or literal strings, to tokenize any of these with strtok or to operate on a string who's contents need to be preserved, str would have to be copied, then the copy could be operated on

provides us with split_view to tokenize strings, in a non-destructive manner: https://topanswers.xyz/cplusplus?q=749#a874


The previous methods cannot generate a tokenized vector in-place, meaning without abstracting them into a helper function they cannot initialize const vector<string> tokens. That functionality and the ability to accept any white-space delimiter can be harnessed using an istream_iterator. For example given: const string str{ "The quick \tbrown \nfox" } we can do this:

istringstream is{ str };
const vector<string> tokens{ istream_iterator<string>(is), istream_iterator<string>() };

Live Example

The required construction of an istringstream for this option has far greater cost than the previous 2 options, however this cost is typically hidden in the expense of string allocation.


If none of the above options are flexable enough for your tokenization needs, the most flexible option is using a regex_token_iterator of course with this flexibility comes greater expense, but again this is likely hidden in the string allocation cost. Say for example we want to tokenize based on non-escaped commas, also eating white-space, given the following input: const string str{ "The ,qu\\,ick ,\tbrown, fox" } we can do this:

const regex re{ "\\s*((?:[^\\\\,]|\\\\.)*?)\\s*(?:,|$)" };
const vector<string> tokens{ sregex_token_iterator(cbegin(str), cend(str), re, 1), sregex_token_iterator() };

Live Example

Byzantine answered 26/7, 2016 at 16:51 Comment(4)
strtok_s is C11 standard, by the way. strtok_r is a POSIX2001 standard. Between both of those, there's a standard re-entrant version of strtok for most platforms.Nadabus
@AndonM.Coleman But this is a c++ question, and in C++ #include <cstring> only includes the c99 version of strtok. So my assumption is that you're just providing this comment as supporting material, demonstrating the implementation specific availability of strtok extensions?Byzantine
Merely that it's not as non-standard as people might otherwise believe. strtok_s is provided by both C11 and as a standalone extension in Microsoft's C runtime. There's a curious bit of history here where Microsoft's _s functions became the C standard.Nadabus
@AndonM.Coleman Right, I'm with you. Obviously if it's in the C11 standard the interface and implementation have constraints placed upon them which require identical behavior independent of platform. Now the only problem is ensuring that the C11 function is available to us across platforms. Hopefully the C11 standard will be something that C++17 or C++20 chooses to pickup.Byzantine
A
7

Check this example. It might help you..

#include <iostream>
#include <sstream>

using namespace std;

int main ()
{
    string tmps;
    istringstream is ("the dellimiter is the space");
    while (is.good ()) {
        is >> tmps;
        cout << tmps << "\n";
    }
    return 0;
}
Amplexicaul answered 20/12, 2010 at 12:25 Comment(1)
I would do while ( is >> tmps ) { std::cout << tmps << "\n"; }Heroic
D
6

MFC/ATL has a very nice tokenizer. From MSDN:

CAtlString str( "%First Second#Third" );
CAtlString resToken;
int curPos= 0;

resToken= str.Tokenize("% #",curPos);
while (resToken != "")
{
   printf("Resulting token: %s\n", resToken);
   resToken= str.Tokenize("% #",curPos);
};

Output

Resulting Token: First
Resulting Token: Second
Resulting Token: Third
Dread answered 22/3, 2009 at 2:28 Comment(1)
This Tokenize() function will skip empty tokens, for example, if there is substring "%%" in main string, there is no empty token returned. It is skipped.Antidromic
D
4

If you're willing to use C, you can use the strtok function. You should pay attention to multi-threading issues when using it.

Dialectal answered 10/9, 2008 at 12:23 Comment(3)
Note that strtok modifes the string you're checking, so you can't use it on const char * strings without making a copy.Teddman
The multithreading issue is that strtok uses a global variable to keep track of where it is, so if you have two threads that each use strtok, you'll get undefined behavior.Soapbox
@Soapbox Or just use strtok_s which is basically strtok with explicit state passing.Nitrosyl
C
4

For simple stuff I just use the following:

unsigned TokenizeString(const std::string& i_source,
                        const std::string& i_seperators,
                        bool i_discard_empty_tokens,
                        std::vector<std::string>& o_tokens)
{
    unsigned prev_pos = 0;
    unsigned pos = 0;
    unsigned number_of_tokens = 0;
    o_tokens.clear();
    pos = i_source.find_first_of(i_seperators, pos);
    while (pos != std::string::npos)
    {
        std::string token = i_source.substr(prev_pos, pos - prev_pos);
        if (!i_discard_empty_tokens || token != "")
        {
            o_tokens.push_back(i_source.substr(prev_pos, pos - prev_pos));
            number_of_tokens++;
        }

        pos++;
        prev_pos = pos;
        pos = i_source.find_first_of(i_seperators, pos);
    }

    if (prev_pos < i_source.length())
    {
        o_tokens.push_back(i_source.substr(prev_pos));
        number_of_tokens++;
    }

    return number_of_tokens;
}

Cowardly disclaimer: I write real-time data processing software where the data comes in through binary files, sockets, or some API call (I/O cards, camera's). I never use this function for something more complicated or time-critical than reading external configuration files on startup.

Constance answered 15/9, 2008 at 15:28 Comment(0)
A
4

You can simply use a regular expression library and solve that using regular expressions.

Use expression (\w+) and the variable in \1 (or $1 depending on the library implementation of regular expressions).

Amianthus answered 22/4, 2011 at 0:14 Comment(2)
+1 for suggesting regex, if you don't need warp speed it is the most flexible solution, not yet supported everywhere but as time goes by that will become less important.Stereoscopic
+1 from me, just tried <regex> in c++11. So simple and elegantOlympia
I
4

Many overly complicated suggestions here. Try this simple std::string solution:

using namespace std;

string someText = ...

string::size_type tokenOff = 0, sepOff = tokenOff;
while (sepOff != string::npos)
{
    sepOff = someText.find(' ', sepOff);
    string::size_type tokenLen = (sepOff == string::npos) ? sepOff : sepOff++ - tokenOff;
    string token = someText.substr(tokenOff, tokenLen);
    if (!token.empty())
        /* do something with token */;
    tokenOff = sepOff;
}
Institute answered 1/8, 2012 at 5:50 Comment(0)
S
3

I thought that was what the >> operator on string streams was for:

string word; sin >> word;
Slambang answered 10/9, 2008 at 12:43 Comment(1)
My fault for giving a bad (too simple) example. A far as I know, that only works when your delimiter is whitespace.Methylal
E
3

I know this question is already answered but I want to contribute. Maybe my solution is a bit simple but this is what I came up with:

vector<string> get_words(string const& text, string const& separator)
{
    vector<string> result;
    string tmp = text;

    size_t first_pos = 0;
    size_t second_pos = tmp.find(separator);

    while (second_pos != string::npos)
    {
        if (first_pos != second_pos)
        {
            string word = tmp.substr(first_pos, second_pos - first_pos);
            result.push_back(word);
        }
        tmp = tmp.substr(second_pos + separator.length());
        second_pos = tmp.find(separator);
    }

    result.push_back(tmp);

    return result;
}

Please comment if there is a better approach to something in my code or if something is wrong.

UPDATE: added generic separator

Engaging answered 9/5, 2018 at 7:12 Comment(2)
Used your solution from the crowd :) Can I modify your code to add any separator?Glorification
@Glorification glad you liked it and ofc you can modify it... just add an bolded update section to my answer...Engaging
A
2

Here's an approach that allows you control over whether empty tokens are included (like strsep) or excluded (like strtok).

#include <string.h> // for strchr and strlen

/*
 * want_empty_tokens==true  : include empty tokens, like strsep()
 * want_empty_tokens==false : exclude empty tokens, like strtok()
 */
std::vector<std::string> tokenize(const char* src,
                                  char delim,
                                  bool want_empty_tokens)
{
  std::vector<std::string> tokens;

  if (src and *src != '\0') // defensive
    while( true )  {
      const char* d = strchr(src, delim);
      size_t len = (d)? d-src : strlen(src);

      if (len or want_empty_tokens)
        tokens.push_back( std::string(src, len) ); // capture token

      if (d) src += len+1; else break;
    }

  return tokens;
}
Amarelle answered 26/10, 2012 at 15:14 Comment(0)
S
2

Seems odd to me that with all us speed conscious nerds here on SO no one has presented a version that uses a compile time generated look up table for the delimiter (example implementation further down). Using a look up table and iterators should beat std::regex in efficiency, if you don't need to beat regex, just use it, its standard as of C++11 and super flexible.

Some have suggested regex already but for the noobs here is a packaged example that should do exactly what the OP expects:

std::vector<std::string> split(std::string::const_iterator it, std::string::const_iterator end, std::regex e = std::regex{"\\w+"}){
    std::smatch m{};
    std::vector<std::string> ret{};
    while (std::regex_search (it,end,m,e)) {
        ret.emplace_back(m.str());              
        std::advance(it, m.position() + m.length()); //next start position = match position + match length
    }
    return ret;
}
std::vector<std::string> split(const std::string &s, std::regex e = std::regex{"\\w+"}){  //comfort version calls flexible version
    return split(s.cbegin(), s.cend(), std::move(e));
}
int main ()
{
    std::string str {"Some people, excluding those present, have been compile time constants - since puberty."};
    auto v = split(str);
    for(const auto&s:v){
        std::cout << s << std::endl;
    }
    std::cout << "crazy version:" << std::endl;
    v = split(str, std::regex{"[^e]+"});  //using e as delim shows flexibility
    for(const auto&s:v){
        std::cout << s << std::endl;
    }
    return 0;
}

If we need to be faster and accept the constraint that all chars must be 8 bits we can make a look up table at compile time using metaprogramming:

template<bool...> struct BoolSequence{};        //just here to hold bools
template<char...> struct CharSequence{};        //just here to hold chars
template<typename T, char C> struct Contains;   //generic
template<char First, char... Cs, char Match>    //not first specialization
struct Contains<CharSequence<First, Cs...>,Match> :
    Contains<CharSequence<Cs...>, Match>{};     //strip first and increase index
template<char First, char... Cs>                //is first specialization
struct Contains<CharSequence<First, Cs...>,First>: std::true_type {}; 
template<char Match>                            //not found specialization
struct Contains<CharSequence<>,Match>: std::false_type{};

template<int I, typename T, typename U> 
struct MakeSequence;                            //generic
template<int I, bool... Bs, typename U> 
struct MakeSequence<I,BoolSequence<Bs...>, U>:  //not last
    MakeSequence<I-1, BoolSequence<Contains<U,I-1>::value,Bs...>, U>{};
template<bool... Bs, typename U> 
struct MakeSequence<0,BoolSequence<Bs...>,U>{   //last  
    using Type = BoolSequence<Bs...>;
};
template<typename T> struct BoolASCIITable;
template<bool... Bs> struct BoolASCIITable<BoolSequence<Bs...>>{
    /* could be made constexpr but not yet supported by MSVC */
    static bool isDelim(const char c){
        static const bool table[256] = {Bs...};
        return table[static_cast<int>(c)];
    }   
};
using Delims = CharSequence<'.',',',' ',':','\n'>;  //list your custom delimiters here
using Table = BoolASCIITable<typename MakeSequence<256,BoolSequence<>,Delims>::Type>;

With that in place making a getNextToken function is easy:

template<typename T_It>
std::pair<T_It,T_It> getNextToken(T_It begin,T_It end){
    begin = std::find_if(begin,end,std::not1(Table{})); //find first non delim or end
    auto second = std::find_if(begin,end,Table{});      //find first delim or end
    return std::make_pair(begin,second);
}

Using it is also easy:

int main() {
    std::string s{"Some people, excluding those present, have been compile time constants - since puberty."};
    auto it = std::begin(s);
    auto end = std::end(s);
    while(it != std::end(s)){
        auto token = getNextToken(it,end);
        std::cout << std::string(token.first,token.second) << std::endl;
        it = token.second;
    }
    return 0;
}

Here is a live example: http://ideone.com/GKtkLQ

Stereoscopic answered 26/7, 2014 at 13:15 Comment(2)
Is it possible to tokennize with an String delimiter ?Chantalchantalle
this version is only optimized for single character delimiters, using a look up table is not suited for multi character (string) delimiters so its harder to beat regex in efficiency.Stereoscopic
U
1

you can take advantage of boost::make_find_iterator. Something similar to this:

template<typename CH>
inline vector< basic_string<CH> > tokenize(
    const basic_string<CH> &Input,
    const basic_string<CH> &Delimiter,
    bool remove_empty_token
    ) {

    typedef typename basic_string<CH>::const_iterator string_iterator_t;
    typedef boost::find_iterator< string_iterator_t > string_find_iterator_t;

    vector< basic_string<CH> > Result;
    string_iterator_t it = Input.begin();
    string_iterator_t it_end = Input.end();
    for(string_find_iterator_t i = boost::make_find_iterator(Input, boost::first_finder(Delimiter, boost::is_equal()));
        i != string_find_iterator_t();
        ++i) {
        if(remove_empty_token){
            if(it != i->begin())
                Result.push_back(basic_string<CH>(it,i->begin()));
        }
        else
            Result.push_back(basic_string<CH>(it,i->begin()));
        it = i->end();
    }
    if(it != it_end)
        Result.push_back(basic_string<CH>(it,it_end));

    return Result;
}
Unconsidered answered 3/8, 2011 at 6:58 Comment(0)
B
1

Here's my Swiss® Army Knife of string-tokenizers for splitting up strings by whitespace, accounting for single and double-quote wrapped strings as well as stripping those characters from the results. I used RegexBuddy 4.x to generate most of the code-snippet, but I added custom handling for stripping quotes and a few other things.

#include <string>
#include <locale>
#include <regex>

std::vector<std::wstring> tokenize_string(std::wstring string_to_tokenize) {
    std::vector<std::wstring> tokens;

    std::wregex re(LR"(("[^"]*"|'[^']*'|[^"' ]+))", std::regex_constants::collate);

    std::wsregex_iterator next( string_to_tokenize.begin(),
                                string_to_tokenize.end(),
                                re,
                                std::regex_constants::match_not_null );

    std::wsregex_iterator end;
    const wchar_t single_quote = L'\'';
    const wchar_t double_quote = L'\"';
    while ( next != end ) {
        std::wsmatch match = *next;
        const std::wstring token = match.str( 0 );
        next++;

        if (token.length() > 2 && (token.front() == double_quote || token.front() == single_quote))
            tokens.emplace_back( std::wstring(token.begin()+1, token.begin()+token.length()-1) );
        else
            tokens.emplace_back(token);
    }
    return tokens;
}
Bermuda answered 21/9, 2018 at 3:28 Comment(3)
(Down)votes can be just as constructive as upvotes, but not when you don't leave comments as to why...Bermuda
I evened you out but it might be because the code looks pretty daunting to the programmer googling 'how to split a string' especially without documentationYi
Thanks @mattshu! Is it the regex segments that make it daunting or something else?Bermuda
U
1

I wrote a simplified version (and maybe a little bit efficient) of https://mcmap.net/q/40729/-how-do-i-tokenize-a-string-in-c for my own use. I hope it would help.

void StrTokenizer(string& source, const char* delimiter, vector<string>& Tokens)
{   
   size_t new_index = 0;
   size_t old_index = 0;

   while (new_index != std::string::npos)   
   {
      new_index = source.find(delimiter, old_index);
      Tokens.emplace_back(source.substr(old_index, new_index-old_index));

      if (new_index != std::string::npos)
          old_index = ++new_index;
   }
}
Ulibarri answered 21/3, 2022 at 21:12 Comment(1)
I was just about to add the same code as another answer! This keeps the empty fields in delimited text, which is VERY useful for pre-indexed character data such as NMEA strings from a GPS receiver.Kohn
D
1

I just read all the answers and can't find solution with next preconditions:

  1. no dynamic memory allocations
  2. no use of boost
  3. no use of regex
  4. c++17 standard only

So here is my solution

#include <iomanip>
#include <iostream>
#include <iterator>
#include <string_view>
#include <utility>

struct split_by_spaces
{
    std::string_view      text;
    static constexpr char delim = ' ';

    struct iterator
    {
        const std::string_view& text;
        std::size_t             cur_pos;
        std::size_t             end_pos;

        std::string_view operator*() const
        {
            return { &text[cur_pos], end_pos - cur_pos };
        }
        bool operator==(const iterator& other) const
        {
            return cur_pos == other.cur_pos && end_pos == other.end_pos;
        }
        bool operator!=(const iterator& other) const
        {
            return !(*this == other);
        }
        iterator& operator++()
        {
            cur_pos = text.find_first_not_of(delim, end_pos);

            if (cur_pos == std::string_view::npos)
            {
                cur_pos = text.size();
                end_pos = cur_pos;
                return *this;
            }

            end_pos = text.find(delim, cur_pos);

            if (end_pos == std::string_view::npos)
            {
                end_pos = text.size();
            }

            return *this;
        }
    };

    [[nodiscard]] iterator begin() const
    {
        auto start = text.find_first_not_of(delim);
        if (start == std::string_view::npos)
        {
            return iterator{ text, text.size(), text.size() };
        }
        auto end_word = text.find(delim, start);
        if (end_word == std::string_view::npos)
        {
            end_word = text.size();
        }
        return iterator{ text, start, end_word };
    }
    [[nodiscard]] iterator end() const
    {
        return iterator{ text, text.size(), text.size() };
    }
};

int main(int argc, char** argv)
{
    using namespace std::literals;
    auto str = " there should be no memory allocation during parsing"
               "  into words this line and you   should'n create any"
               "  contaner                  for intermediate words  "sv;

    auto comma = "";
    for (std::string_view word : split_by_spaces{ str })
    {
        std::cout << std::exchange(comma, ",") << std::quoted(word);
    }

    auto only_spaces = "                   "sv;
    for (std::string_view word : split_by_spaces{ only_spaces })
    {
        std::cout << "you will not see this line in output" << std::endl;
    }
}
Dibbuk answered 7/11, 2022 at 13:58 Comment(1)
in operator ++, the second ``` if (cur_pos == std::string_view::npos)``` should be if (end_pos == ...)Sudatory
O
0

If the maximum length of the input string to be tokenized is known, one can exploit this and implement a very fast version. I am sketching the basic idea below, which was inspired by both strtok() and the "suffix array"-data structure described Jon Bentley's "Programming Perls" 2nd edition, chapter 15. The C++ class in this case only gives some organization and convenience of use. The implementation shown can be easily extended for removing leading and trailing whitespace characters in the tokens.

Basically one can replace the separator characters with string-terminating '\0'-characters and set pointers to the tokens withing the modified string. In the extreme case when the string consists only of separators, one gets string-length plus 1 resulting empty tokens. It is practical to duplicate the string to be modified.

Header file:

class TextLineSplitter
{
public:

    TextLineSplitter( const size_t max_line_len );

    ~TextLineSplitter();

    void            SplitLine( const char *line,
                               const char sep_char = ',',
                             );

    inline size_t   NumTokens( void ) const
    {
        return mNumTokens;
    }

    const char *    GetToken( const size_t token_idx ) const
    {
        assert( token_idx < mNumTokens );
        return mTokens[ token_idx ];
    }

private:
    const size_t    mStorageSize;

    char           *mBuff;
    char          **mTokens;
    size_t          mNumTokens;

    inline void     ResetContent( void )
    {
        memset( mBuff, 0, mStorageSize );
        // mark all items as empty:
        memset( mTokens, 0, mStorageSize * sizeof( char* ) );
        // reset counter for found items:
        mNumTokens = 0L;
    }
};

Implementattion file:

TextLineSplitter::TextLineSplitter( const size_t max_line_len ):
    mStorageSize ( max_line_len + 1L )
{
    // allocate memory
    mBuff   = new char  [ mStorageSize ];
    mTokens = new char* [ mStorageSize ];

    ResetContent();
}

TextLineSplitter::~TextLineSplitter()
{
    delete [] mBuff;
    delete [] mTokens;
}


void TextLineSplitter::SplitLine( const char *line,
                                  const char sep_char   /* = ',' */,
                                )
{
    assert( sep_char != '\0' );

    ResetContent();
    strncpy( mBuff, line, mMaxLineLen );

    size_t idx       = 0L; // running index for characters

    do
    {
        assert( idx < mStorageSize );

        const char chr = line[ idx ]; // retrieve current character

        if( mTokens[ mNumTokens ] == NULL )
        {
            mTokens[ mNumTokens ] = &mBuff[ idx ];
        } // if

        if( chr == sep_char || chr == '\0' )
        { // item or line finished
            // overwrite separator with a 0-terminating character:
            mBuff[ idx ] = '\0';
            // count-up items:
            mNumTokens ++;
        } // if

    } while( line[ idx++ ] );
}

A scenario of usage would be:

// create an instance capable of splitting strings up to 1000 chars long:
TextLineSplitter spl( 1000 );
spl.SplitLine( "Item1,,Item2,Item3" );
for( size_t i = 0; i < spl.NumTokens(); i++ )
{
    printf( "%s\n", spl.GetToken( i ) );
}

output:

Item1

Item2
Item3
Olga answered 15/5, 2011 at 20:47 Comment(0)
O
0

boost::tokenizer is your friend, but consider making your code portable with reference to internationalization (i18n) issues by using wstring/wchar_t instead of the legacy string/char types.

#include <iostream>
#include <boost/tokenizer.hpp>
#include <string>

using namespace std;
using namespace boost;

typedef tokenizer<char_separator<wchar_t>,
                  wstring::const_iterator, wstring> Tok;

int main()
{
  wstring s;
  while (getline(wcin, s)) {
    char_separator<wchar_t> sep(L" "); // list of separator characters
    Tok tok(s, sep);
    for (Tok::iterator beg = tok.begin(); beg != tok.end(); ++beg) {
      wcout << *beg << L"\t"; // output (or store in vector)
    }
    wcout << L"\n";
  }
  return 0;
}
Ondrej answered 16/7, 2012 at 1:14 Comment(2)
"legacy" is definitely not correct and wchar_t is a horrible implementation dependent type that nobody should use unless absolutely necessary.Hamner
Use of wchar_t doesn't somehow automatically solve any i18n issues. You use encodings to solve that problem. If you're splitting a string by a delimiter, it is implied that the delimiter doesn't collide with the encoded contents of any token inside the string. Escaping may be needed, etc. wchar_t isn't a magical solution to this.Lala
O
0

Simple C++ code (standard C++98), accepts multiple delimiters (specified in a std::string), uses only vectors, strings and iterators.

#include <iostream>
#include <vector>
#include <string>
#include <stdexcept> 

std::vector<std::string> 
split(const std::string& str, const std::string& delim){
    std::vector<std::string> result;
    if (str.empty())
        throw std::runtime_error("Can not tokenize an empty string!");
    std::string::const_iterator begin, str_it;
    begin = str_it = str.begin(); 
    do {
        while (delim.find(*str_it) == std::string::npos && str_it != str.end())
            str_it++; // find the position of the first delimiter in str
        std::string token = std::string(begin, str_it); // grab the token
        if (!token.empty()) // empty token only when str starts with a delimiter
            result.push_back(token); // push the token into a vector<string>
        while (delim.find(*str_it) != std::string::npos && str_it != str.end())
            str_it++; // ignore the additional consecutive delimiters
        begin = str_it; // process the remaining tokens
        } while (str_it != str.end());
    return result;
}

int main() {
    std::string test_string = ".this is.a.../.simple;;test;;;END";
    std::string delim = "; ./"; // string containing the delimiters
    std::vector<std::string> tokens = split(test_string, delim);           
    for (std::vector<std::string>::const_iterator it = tokens.begin(); 
        it != tokens.end(); it++)
            std::cout << *it << std::endl;
}
Ostap answered 15/12, 2013 at 1:9 Comment(0)
C
0
/// split a string into multiple sub strings, based on a separator string
/// for example, if separator="::",
///
/// s = "abc" -> "abc"
///
/// s = "abc::def xy::st:" -> "abc", "def xy" and "st:",
///
/// s = "::abc::" -> "abc"
///
/// s = "::" -> NO sub strings found
///
/// s = "" -> NO sub strings found
///
/// then append the sub-strings to the end of the vector v.
/// 
/// the idea comes from the findUrls() function of "Accelerated C++", chapt7,
/// findurls.cpp
///
void split(const string& s, const string& sep, vector<string>& v)
{
    typedef string::const_iterator iter;
    iter b = s.begin(), e = s.end(), i;
    iter sep_b = sep.begin(), sep_e = sep.end();

    // search through s
    while (b != e){
        i = search(b, e, sep_b, sep_e);

        // no more separator found
        if (i == e){
            // it's not an empty string
            if (b != e)
                v.push_back(string(b, e));
            break;
        }
        else if (i == b){
            // the separator is found and right at the beginning
            // in this case, we need to move on and search for the
            // next separator
            b = i + sep.length();
        }
        else{
            // found the separator
            v.push_back(string(b, i));
            b = i;
        }
    }
}

The boost library is good, but they are not always available. Doing this sort of things by hand is also a good brain exercise. Here we just use the std::search() algorithm from the STL, see the above code.

Catnip answered 25/2, 2014 at 6:40 Comment(0)
M
0

I've been searching for a way to split a string by a separator of any length, so I started writing it from scratch, as existing solutions didn't suit me.

Here is my little algorithm, using only STL:

//use like this
//std::vector<std::wstring> vec = Split<std::wstring> (L"Hello##world##!", L"##");

template <typename valueType>
static std::vector <valueType> Split (valueType text, const valueType& delimiter)
{
    std::vector <valueType> tokens;
    size_t pos = 0;
    valueType token;

    while ((pos = text.find(delimiter)) != valueType::npos) 
    {
        token = text.substr(0, pos);
        tokens.push_back (token);
        text.erase(0, pos + delimiter.length());
    }
    tokens.push_back (text);

    return tokens;
}

It can be used with separator of any length and form, as far as I've tested. Instantiate with either string or wstring type.

All the algorithm does is it searches for the delimiter, gets the part of the string that is up to the delimiter, deletes the delimiter and searches again until it finds it no more.

Hope it helps.

Mint answered 17/3, 2014 at 16:54 Comment(0)
T
0

I made a lexer/tokenizer before with the use of only standard libraries. Here's the code:

#include <iostream>
#include <string>
#include <vector>
#include <sstream>

using namespace std;

string seps(string& s) {
    if (!s.size()) return "";
    stringstream ss;
    ss << s[0];
    for (int i = 1; i < s.size(); i++) {
        ss << '|' << s[i];
    }
    return ss.str();
}

void Tokenize(string& str, vector<string>& tokens, const string& delimiters = " ")
{
    seps(str);

    // Skip delimiters at beginning.
    string::size_type lastPos = str.find_first_not_of(delimiters, 0);
    // Find first "non-delimiter".
    string::size_type pos = str.find_first_of(delimiters, lastPos);

    while (string::npos != pos || string::npos != lastPos)
    {
        // Found a token, add it to the vector.
        tokens.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters.  Note the "not_of"
        lastPos = str.find_first_not_of(delimiters, pos);
        // Find next "non-delimiter"
        pos = str.find_first_of(delimiters, lastPos);
    }
}

int main(int argc, char *argv[])
{
    vector<string> t;
    string s = "Tokens for everyone!";

    Tokenize(s, t, "|");

    for (auto c : t)
        cout << c << endl;

    system("pause");

    return 0;
}
Tropophilous answered 15/1, 2015 at 17:13 Comment(0)
I
-4

This a simple loop to tokenise with only standard library files

#include <iostream.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <conio.h>
class word
    {
     public:
     char w[20];
     word()
      {
        for(int j=0;j<=20;j++)
        {w[j]='\0';
      }
   }



};

void main()
  {
    int i=1,n=0,j=0,k=0,m=1;
    char input[100];
    word ww[100];
    gets(input);

    n=strlen(input);


    for(i=0;i<=m;i++)
      {
        if(context[i]!=' ')
         {
            ww[k].w[j]=context[i];
            j++;

         }
         else
        {
         k++;
         j=0;
         m++;
        }

   }
 }
Involve answered 19/5, 2013 at 13:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.