How do I tokenize a string in C++?

M

37

478

Java has a convenient split method:

String str = "The quick brown fox";
String[] results = str.split(" ");

Is there an easy way to do this in C++?

Methylal answered 10/9, 2008 at 12:10 Comment(2)

Possible duplicate of Split a string in C++? – Rabush 8/5, 2017 at 19:16

A solution to exact this question seems to be here: riptutorial.com/cplusplus/example/2148/tokenize – Kurt 14/3, 2022 at 14:58

M

177

C++ standard library algorithms are pretty universally based around iterators rather than concrete containers. Unfortunately this makes it hard to provide a Java-like split function in the C++ standard library, even though nobody argues that this would be convenient. But what would its return type be? std::vector<std::basic_string<…>>? Maybe, but then we’re forced to perform (potentially redundant and costly) allocations.

Instead, C++ offers a plethora of ways to split strings based on arbitrarily complex delimiters, but none of them is encapsulated as nicely as in other languages. The numerous ways fill whole blog posts.

At its simplest, you could iterate using std::string::find until you hit std::string::npos, and extract the contents using std::string::substr.

A more fluid (and idiomatic, but basic) version for splitting on whitespace would use a std::istringstream:

auto iss = std::istringstream{"The quick brown fox"};
auto str = std::string{};

while (iss >> str) {
    process(str);
}

Using std::istream_iterators, the contents of the string stream could also be copied into a vector using its iterator range constructor.

Multiple libraries (such as Boost.Tokenizer) offer specific tokenisers.

More advanced splitting require regular expressions. C++ provides the std::regex_token_iterator for this purpose in particular:

auto const str = "The quick brown fox"s;
auto const re = std::regex{R"(\s+)"};
auto const vec = std::vector<std::string>(
    std::sregex_token_iterator{begin(str), end(str), re, -1},
    std::sregex_token_iterator{}
);

Murry answered 10/9, 2008 at 12:18 Comment(23)

Sadly, boost is not always available for all projects. I'll have to look for a non-boost answer. – Barrick 20/12, 2013 at 21:0

Not every project is open to "open source". I work in heavily regulated industries. It's not a problem, really. It's just a fact of life. Boost is not available everywhere. – Barrick 20/12, 2013 at 23:19

@NonlinearIdeas The other question / answer wasn’t about Open Source projects at all. The same is true for any project. That said, I of course understand about restricted standards such as MISRA C but then it’s understood that you build everything from scratch anyway (unless you happen to find a compliant library – a rarity). Anyway, the point is hardly that “Boost is not available” – it’s that you have special requirements for which almost any general-purpose answer would be unsuitable. – Murry 21/12, 2013 at 10:46

@NonlinearIdeas Case in point, the other, non-Boost answers are also not MISRA compliant. – Murry 21/12, 2013 at 10:47

This discussion sparked me to ask about my specific industry of concern: #20714509. – Barrick 21/12, 2013 at 12:2

@Barrick Programming competitions don't allow you to use libraries at all. – Dug 7/7, 2016 at 15:24

I tried to install boost 3 times, each of these times I was discourged by STL barf. It's 2016, can't we just replace C preprocessor with PHP or JavaScript and embed it into our C++ files so that we use real transformation functions that transform string to string and give a sane error if something goes wrong, instead of the archaic #ifdef #ifndef that don't even make sense in C/C++ since the language wants to be condensable to "a single line" but macros require their own lines. Even Perl as a preprocessor would be better than using #ifdefs/#ifndef/template, and it's non blackbox transforms. – Frosting 11/9, 2016 at 17:23

@Dmitry What’s “STL barf”?! And the whole community is very much in favour of replacing the C preprocessor — in fact, there are proposals to do that. But your suggestion to use PHP or some other language instead would be a huge step backwards. – Murry 11/9, 2016 at 17:43

@KonradRudolph by STL barf I ~2-3 screens expanded templates which shows an error often in code that is not in your program which is difficult to figure out what went wrong for people who aren't used to it. That said I've learnt a lot since then so I guess i'll try setting boost up again, it's something that I should be able to set up. – Frosting 11/9, 2016 at 18:6

But what would its return type be? How about returning what Java and Python both return: an array of string. Seems like common sense. – Telluric 28/8, 2021 at 3:30

@Telluric And what is an “arrray of strings” in C++, pray tell? Even in Java the decision to return an array is actually problematic. In C++ it would be a complete non-starter in a generic library, since C++ has so many different array (and string) types, and a generic C++ library can’t just assume one, because it would clash with lots of client code. – Murry 28/8, 2021 at 8:19

@KonradRudolph: An array of strings looks like this: string result[42]. You can return a pointer to a heap-allocated array of string. Or you can return a pointer to an array of pointer to string. Or you can return a std::vector<string> by value. If C++ doesn't have a way to easily perform a string split to return a sequence of string tokens, then there is an underlying problem with the language runtime. I worked with C++ for close to ten years. I hence moved to Java and Python and never looked back. I'm still amazed C++ hasn't come up a solution all this time. – Telluric 28/8, 2021 at 16:39

@Telluric Surely you must be aware that you cannot return C arrays by value in C++. You could technically return manually managed memory but this is completely unacceptable for a general-purpose standard algorithm. It’s atrocious for code quality, and nothing else in C++ does this (except for legacy C compatibility functions). And even std::vector is not a good generic type, and for this reason no C++ standard library algorithm returns it. – Murry 28/8, 2021 at 20:34

@Telluric And contrary to your claim it’s absolutely no problem to split strings into tokens in C++. My answer shows several ways to do this trivially. And C++ has added more ways since. But none of them returns an array of strings, for reasons explained. – Murry 28/8, 2021 at 20:37

@KonradRudolph: And what's wrong with returning a std::vector by value out of a function? – Telluric 28/8, 2021 at 20:41

@Telluric It locks the user in to use a specific container, which is anathema to the whole design of the C++ algorithms library. There are many situations where std::vector can’t or won’t be used. – Murry 28/8, 2021 at 20:50

@KonradRudolph: A string splitter should return a sequence of tokens. vector returns a sequence of things. They are a match for one another. – Telluric 28/8, 2021 at 20:59

@Telluric Again, not every code base uses std::vector, and even when a codebase uses std::vector for some parts, it’s not appropriate everywhere. std::vector isn’t the only container that models “sequence of X”, and users may not want to use it here. The algorithm can’t know. And generic algorithms in C++ are specifically designed to be container agnostic, to be, well, generic. There’s nothing wrong with implementing a string splitter to return std::vector<std::string> for specific uses, but this implementation has no place in the C++ algorithms standard library. – Murry 28/8, 2021 at 22:50

@KonradRudolph: If someone wants the tokens in another data structure, like a map, then they can write their own code to convert the return value of the splitter to their desired data structure. The point is that the return value of a splitter has to be in some data structure, and the most logical one is vector. This is just common sense. Java's split returns an array of string; if you want the result in a map, then you need to write your own code to convert the array of string to a map. – Telluric 28/8, 2021 at 23:4

@Telluric It fundamentally breaks C++’s design model, which is to provide zero-cost abstractions. The abstraction you are proposing would be anything but zero-cost. Frankly, at this point you’re just arguing for argument’s sake, and this discussion is completely unproductive (especially since it just rehashes what the answer already says). There’s no actual issue here: the solutions presented in my answer (and elsewhere) work. – Murry 28/8, 2021 at 23:26

What do you propose would be a "zero-cost abstraction" for a string splitter? An iterator over strings? – Telluric 29/8, 2021 at 1:11

I recommend you delete your answer before others read this bad advice. Please stick to PHP and HTML. – Telluric 30/8, 2021 at 4:17

@Telluric You should read the answer, it addresses that already. Anyway, if you don’t like the way C++ works you’re indeed free to stick to PHP and HTML, I won’t stop you. – Murry 31/8, 2021 at 8:38

T

194

The Boost tokenizer class can make this sort of thing quite simple:

#include <iostream>
#include <string>
#include <boost/foreach.hpp>
#include <boost/tokenizer.hpp>

using namespace std;
using namespace boost;

int main(int, char**)
{
    string text = "token, test   string";

    char_separator<char> sep(", ");
    tokenizer< char_separator<char> > tokens(text, sep);
    BOOST_FOREACH (const string& t, tokens) {
        cout << t << "." << endl;
    }
}

Updated for C++11:

#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>

using namespace std;
using namespace boost;

int main(int, char**)
{
    string text = "token, test   string";

    char_separator<char> sep(", ");
    tokenizer<char_separator<char>> tokens(text, sep);
    for (const auto& t : tokens) {
        cout << t << "." << endl;
    }
}

Tempered answered 11/9, 2008 at 2:10 Comment(7)

Good stuff, I've recently utilized this. My Visual Studio compiler has an odd whinge until I use a whitespace to separate the two ">" characters before the tokens(text, sep) bit: (error C2947: expecting '>' to terminate template-argument-list, found '>>') – Aggappora 1/10, 2010 at 15:57

@Aggappora yes, without the space the compiler parses it as an extraction operator rather than two closing templates. – Forepleasure 14/6, 2011 at 3:23

Theoretically that's been fixed in C++0x – Dosage 1/9, 2011 at 2:9

beware of the third parameters of the char_separator constructor (drop_empty_tokens is the default, alternative is keep_empty_tokens). – Vanegas 17/2, 2012 at 10:56

@DavidSouther Talking about C++0x - the BOOST_FOREACH could now be replaced by the new for loop syntax. for ( string t : tokens ) { ... } – Electrotherapy 29/2, 2012 at 15:3

The double >> shouldn't be a problem with C++11, if so then VS2010 compilers+ have a bug and you should file it on connect. – Ladysmith 21/7, 2013 at 20:51

@puk - It's a commonly used suffix for C++ header files. (like .h for C headers) – Tempered 12/12, 2013 at 22:44

R

183

Here's a real simple one:

#include <vector>
#include <string>

vector<string> split(const char *str, char c = ' ')
{
    std::vector<std::string> result;

    do
    {
        const char *begin = str;

        while(*str != c && *str)
            str++;

        result.push_back(std::string(begin, str));
    } while (0 != *str++);

    return result;
}

Rissa answered 10/9, 2008 at 12:30 Comment(5)

do I need to add a prototype for this method in .h file ? – Pelvis 22/12, 2011 at 8:26

This is not exactly the "best" answer as it still uses a string literal which is the plain C constant character array. I believe the questioner was asking if he could tokenize a C++ string which is of type "string" introduced by the latter. – Asha 19/4, 2017 at 7:5

This needs a new answer because I strongly suspect the inclusion of regular expressions in C++11 has changed what the best answer would be. – Broussard 25/10, 2017 at 15:45

To this answer have problem with strings that the first/last char is equal to the separator. e.g. the string " a" results is [" ", "a"]. – Skulk 13/11, 2020 at 11:26

@y30: I think it results in ["","a"]. – Marcie 25/11, 2023 at 8:40

M

177