Searching for Holy Grail of search and replace in C++

Asked 21/12, 2015 at 18:55 Answered 19/7, 2019 at 12:15

Solved c++algorithm boost mfc boost-spirit-qi

Recently I was looking for a way to replace tokens in string which is essentially find and replace (but there is at least one additional approach to the problem) and looks like quite banal task. I've came with couple of possible implementations but none of them was satisfying from performance point of view. Best achievement was ~50us per iteration. The case was ideal, the string was never growing in size and initially I've omitted the requirement to be case insensitive
Here is the code at Coliru

Results from my machine:
Boost.Spirit symbols result: 3421?=3421
100000 cycles took 6060ms.
Boyer-Moore results:3421?=3421
100000 cycles took 5959ms.
Boyer Moore Hospool result:3421?=3421
100000 cycles took 5008ms.
Knuth Morris Pratt result: 3421?=3421
100000 cycles took 12451ms.
Naive STL search and replace result: 3421?=3421
100000 cycles took 5532ms.
Boost replace_all result:3421?=3421
100000 cycles took 4860ms.

So the question, what takes so long in such a simple task? One can say, ok, simple task, go ahead and implement it better. But the reality is that 15 years old MFC naive implementation does the task order of magnitude faster:

CString FillTokenParams(const CString& input, const std::unordered_map<std::string, std::string>& tokens)
{
    CString tmpInput = input;
    for(const auto& token : tokens)
    {
        int pos = 0;
        while(pos != -1)
        {
            pos = tmpInput.Find(token.first.c_str(), pos);
            if(pos != -1)
            {
                int tokenLength = token.first.size();
                tmpInput.Delete(pos, tokenLength);
                tmpInput.Insert(pos, token.second.c_str());
                pos += 1;
            }
        }
    }

    return tmpInput;
}

result:
MFC naive search and replace result:3421?=3421
100000 cycles took 516ms.
How come this clumsy code outperforms modern C++??? why other implementations were so slow? Am I missing something fundamental?

EDIT001: I've invested in this question, the code been profiled and triple checked. You can be unsatisfied with this and that, but std::string::replace is not what taking time. In any STL implementation search is what taking most of the time, boost spirit wastes time on allocation of tst (test node in evaluation tree I guess). I dont expect anyone pointing on a line in a function with "this is your problem" and poof, the problem is gone. The question is how did MFC manages to do the same work 10 times faster.

EDIT002: Just drilled down into MFC implementation of Find and wrote a function which mimics the MFC implementation

namespace mfc
{
std::string::size_type Find(const std::string& input, const std::string& subString, std::string::size_type start)
{
    if(subString.empty())
    {
        return std::string::npos;
    }

    if(start < 0 || start > input.size())
    {
        return std::string::npos;
    }

    auto found = strstr(input.c_str() + start, subString.c_str());
    return ((found == nullptr) ? std::string::npos : std::string::size_type(found - input.c_str()));
}
}

std::string MFCMimicking(const std::string& input, const std::unordered_map<std::string, std::string>& tokens)
{
    auto tmpInput = input;
    for(const auto& token : tokens)
    {
        auto pos = 0;
        while(pos != std::string::npos)
        {
            pos = mfc::Find(tmpInput, token.first, pos);
            if(pos != std::string::npos)
            {
                auto tokenLength = token.first.size();
                tmpInput.replace(pos, tokenLength, token.second.c_str());
                pos += 1;
            }
        }
    }

    return tmpInput;
}

Results:
MFC mimicking expand result:3421?=3421
100000 cycles took 411ms.
Meaning 4us. per call, go beat that C strstr

EDIT003: Compiling and running with -Ox

MFC mimicking expand result:3421?=3421
100000 cycles took 660ms.
MFC naive search and replace result:3421?=3421
100000 cycles took 856ms.
Manual expand result:3421?=3421
100000 cycles took 1995ms.
Boyer-Moore results:3421?=3421
100000 cycles took 6911ms.
Boyer Moore Hospool result:3421?=3421
100000 cycles took 5670ms.
Knuth Morris Pratt result: 3421?=3421
100000 cycles took 13825ms.
Naive STL search and replace result: 3421?=3421
100000 cycles took 9531ms.
Boost replace_all result:3421?=3421
100000 cycles took 8996ms.

running with -O2 (as in original measurements) but 10k cycles

MFC mimicking expand result:3421?=3421
10000 cycles took 104ms.
MFC naive search and replace result:3421?=3421
10000 cycles took 105ms.
Manual expand result:3421?=3421
10000 cycles took 356ms.
Boyer-Moore results:3421?=3421
10000 cycles took 1355ms.
Boyer Moore Hospool result:3421?=3421
10000 cycles took 1101ms.
Knuth Morris Pratt result: 3421?=3421
10000 cycles took 1973ms.
Naive STL search and replace result: 3421?=3421
10000 cycles took 923ms.
Boost replace_all result:3421?=3421
10000 cycles took 880ms.

Parve answered 21/12, 2015 at 18:55 Comment(23)

What's the definition for CString? Is it your class? It is not a standard C++ class. – Endpaper 21/12, 2015 at 19:1

Please clarify "modern C++". Are you talking C++14? Are you talking Microsoft's CLI? – Endpaper 21/12, 2015 at 19:2

Please show the code performing (measuring) the benchmarking. Put the code into your question. – Endpaper 21/12, 2015 at 19:3

CString - MFC, Microsoft Foundation Classes, an ancient knowledge, lost centuries ago :) By modern I mean nowadays, and not something that was implemented two decades ago, be it C++11,14,17 and yes, C++03 can be modern too :) – Parve 21/12, 2015 at 19:29

@ThomasMatthews As of to put code into question, not a good idea for 200 lines of code, right? That what Coliru for – Parve 21/12, 2015 at 19:31

@ThomasMatthews How to "show the code performing"? Profiling session screenshots? – Parve 21/12, 2015 at 19:32

503 Service Temporarily Unavailable :( – Parve 21/12, 2015 at 20:11

Into the spirit part? I have a particular question with symbols. In case I know token names in compile time, but dont know the values which will be calculated in runtime, can I somehow employ the static lexer model? – Parve 21/12, 2015 at 20:14

All parts. You asked a question ;) BTW were you the one asking on the mailing list? – Belonging 21/12, 2015 at 20:16

boost mailing list? hm... nope... not recently... – Parve 21/12, 2015 at 20:24

Let us continue this discussion in chat. – Belonging 21/12, 2015 at 20:31

Post the section of code that you are profiling, such as the statements between start of reading the clock and after reading the clock. – Endpaper 21/12, 2015 at 20:32

I take it that you're comparing MFC with Microsoft's implementation of the STL. That makes the most sense (keep the platform the same for each test), but means that other implementations may do a better job. – Masquerade 21/12, 2015 at 22:16

@MaxLybbert first of all you are right regarding Microsoft implementation. and honestly, I dont see any reason not to trust Stephan T. Lavavej and his team, no STLPort in foreseeable future – Parve 22/12, 2015 at 4:52

solved it. see latest answer - simpler than we thought! :) – Reliance 22/12, 2015 at 12:47

I suggest adding your own answer with the MFCMimicking code, because it does appear to be the winner - for the payloads shown; code /cc @RichardHodges – Belonging 22/12, 2015 at 15:4

@Belonging didn't see the latest update. I am merely answering the question as to why the op saw an order of magnitude improvement in mfc. The same holds for strstr - it's an optimised (possibly even intrinsic) function in the standard library. – Reliance 22/12, 2015 at 15:8

@sehe, will do, but later, I'm waiting for Boost.Spirit mailing list to answer my question with running X3 on VS2015 Update1. and I want to use that cool Nonius framework for my answer :) – Parve 22/12, 2015 at 17:25

@Belonging after an extended discussion with kreuzerkrieg it seems to me that it comes down to the implementation of strstr in the CRT. I am working on apple clang linked against libc++ (rather than libstdc++). I see no 10x improvement when strstr gets involve - more like a 20% improvement. – Reliance 22/12, 2015 at 20:3

All, if you would like to build and run this code coliru.stacked-crooked.com/a/a056016b8e75825f it will be really appreciated. Please report your platform, compiler and results. dont forget -O3 :) – Parve 22/12, 2015 at 20:15

@Belonging how do you build these cool graphs, I mean PNGs not interactive one, didnt find it in nonius framework – Parve 22/12, 2015 at 20:17

I linked to the github.com/rmartinho/nonius framework :) Try ./test -r html -o stats.html. The pngs are just screen grabs. The results of those timings are already here – Belonging 22/12, 2015 at 20:37

That's what I mean, I already using this framework, thought there is a plugin or something to generate PNGs. Will try this -r html -o stats.html stuff – Parve 22/12, 2015 at 20:41

Ok, it will be a long story. Just to remind you questions asked.

Why the search and replace using C++ (various approaches) so slow?
Why the MFC search and replace so fast?

Surprisingly, both questions have the same answer. Because of C++ overhead. Yep. Our shiny modern C++ has an overhead which mostly we dismiss for the flexibility and elegance.

However, when it comes to sub-microsecond resolutions (not that C++ is not capable of doing things in nanosecond resolutions) the overhead becomes more prominent.

Let me showcase with the same code I've posted in the question, but it is more coherent with things done in each function.

Live On Coliru

It uses aforementioned Nonius (thanks to @sehe) and interactive results are here.

You can click the legend to show/hide particular series.

Conclusions

There are two outstanding results

MFC mimicking function and
my own manual replace

These functions at least one order of magnitude are faster than the rest, so what is the difference?

All these "slow" functions are written in C++ when the fast is written in C (not pure C, I was too lazy to deal with malloc/realloc of output buffers when the output grows in size). Well, I guess it is clear that sometimes there is no choice but resort to pure C. I personally against using C for security reasons and lack of type safety. In addition it simply requires more expertise and attention to write high quality C code.

I will not mark it as an answer for a while, waiting for comments on this conclusion.

I want to thank all those who actively participated in the discussion, raised ideas and pointed out inconsistencies in my example.

2019 update:
Just to preserve the code: https://github.com/kreuzerkrieg/string_search_replace
Nonius results: https://kreuzerkrieg.github.io/string_search_replace/

Ran with gcc-9 on Ubuntu 18.04

Parve answered 23/12, 2015 at 16:21 Comment(6)

I think you forgot to test your kreuzerkriegManual function. It results in this. Also, mine is not "reverse" and you handicapped it a little by manipulating the key each time. Can you update the comparison with your bug fixed? – Belonging 23/12, 2015 at 21:17

I've tested it, then I've changed it a little, and most likely introduced some bug there. will fix it later, dont have an access to my machine today – Parve 24/12, 2015 at 5:4

@sehe, just tested it, works as expected, also checked the version of code that I've posted, it is the same one I'm running locally. Odd story... – Parve 27/12, 2015 at 5:20

Not odd at all. Turns out you just used the wrong token map: side-by-side.cpp (morale: assert is useless if you mandate NDEBUG). Fortunately, the performance difference favours the correct version: interactive graph – Belonging 27/12, 2015 at 23:40

ah, darn! The morale is dont run #define dependent code and expect same results! And thanks for assistance! – Parve 28/12, 2015 at 5:23

Could you please accept this answer so the ultimate conclusions get more airing? – Belonging 29/12, 2015 at 21:35

So, I have some observations about the Qi version.

Also created an X3 version.

Finally, wrote a manual expansion function that beats all the other candidates (I expect it to be faster than MFC, because it doesn't bother with repeated deletes/inserts).

Skip to the benchmark charts if you want.

About the Qi version

yes, symbol tables suffer from locality issues of node-based containers. They might not be the best match you can use here.
There's no need to rebuild the symbols each loop:
Instead of character-wise skipping of non-symbols, scan until the next:
```
+(bsq::char_ - symbols)
```

inline std::string spirit_qi(const std::string& input, bsq::symbols<char, std::string> const& symbols)
{
    std::string retVal;
    retVal.reserve(input.size() * 2);

    auto beg = input.cbegin();
    auto end = input.cend();

    if(!bsq::parse(beg, end, *(symbols | +(bsq::char_ - symbols)), retVal))
        retVal = input;

    return retVal;
}

This is considerably faster already. But:

Manual Loops

In this trivial example, why don't you just do the parsing manually?

inline std::string manual_expand(const std::string& input, TokenMap const& tokens)
{
    std::ostringstream builder;
    auto expand = [&](auto const& key) {
        auto match = tokens.find(key);
        if (match == tokens.end())
            builder << "$(" << key << ")";
        else
            builder << match->second;
    };

    builder.str().reserve(input.size()*2);

    builder.str("");
    std::ostreambuf_iterator<char> out(builder);

    for(auto f(input.begin()), l(input.end()); f != l;) {
        switch(*f) {
            case '$' : {
                    if (++f==l || *f!='(') {
                        *out++ = '$';
                        break;
                    }
                    else {
                        auto s = ++f;
                        size_t n = 0;

                        while (f!=l && *f != ')')
                            ++f, ++n;

                        // key is [s,f] now
                        expand(std::string(&*s, &*s+n));

                        if (f!=l)
                            ++f; // skip '}'
                    }
                }
            default:
                *out++ = *f++;
        }
    }
    return builder.str();
}

This is vastly superior in performance on my machine.

Other ideas

You could look at Boost Spirit Lex, potentially with statically generated token tables: http://www.boost.org/doc/libs/1_60_0/libs/spirit/doc/html/spirit/lex/abstracts/lexer_static_model.html. I'm not particularly fond of Lex though.

COMPARISONS:

See the Interactive Chart

That's using Nonius for the benchmarking statistics.

Full Benchmark code: http://paste.ubuntu.com/14133072/

#include <boost/container/flat_map.hpp>

#define USE_X3
#ifdef USE_X3
#   include <boost/spirit/home/x3.hpp>
#else
#   include <boost/spirit/include/qi.hpp>
#endif

#include <boost/algorithm/string.hpp>
#include <boost/algorithm/searching/boyer_moore.hpp>
#include <boost/algorithm/searching/boyer_moore_horspool.hpp>
#include <boost/algorithm/searching/knuth_morris_pratt.hpp>
#include <string>
#include <unordered_map>
#include <iostream>
#include <fstream>
#include <nonius/benchmark.h++>
#include <nonius/main.h++>

using TokenMap = boost::container::flat_map<std::string, std::string>;

#ifdef USE_X3
    namespace x3  = boost::spirit::x3;

    struct append {
        std::string& out;
        void do_append(char const ch) const                       { out += ch;                      } 
        void do_append(std::string const& s)  const               { out += s;                       } 
        template<typename It>
        void do_append(boost::iterator_range<It> const& r)  const { out.append(r.begin(), r.end()); } 
        template<typename Ctx>
        void operator()(Ctx& ctx) const                           { do_append(_attr(ctx));          } 
    };

    inline std::string spirit_x3(const std::string& input, x3::symbols<char const*> const& symbols)
    {
        std::string retVal;
        retVal.reserve(input.size() * 2);
        append appender { retVal };

        auto beg = input.cbegin();
        auto end = input.cend();

        auto rule = *(symbols[appender] | x3::char_ [appender]);

        if(!x3::parse(beg, end, rule))
            retVal = input;

        return retVal;
    }
#else
    namespace bsq = boost::spirit::qi;

    inline std::string spirit_qi_old(const std::string& input, TokenMap const& tokens)
    {
        std::string retVal;
        retVal.reserve(input.size() * 2);
        bsq::symbols<char const, char const*> symbols;
        for(const auto& token : tokens) {
            symbols.add(token.first.c_str(), token.second.c_str());
        }

        auto beg = input.cbegin();
        auto end = input.cend();

        if(!bsq::parse(beg, end, *(symbols | bsq::char_), retVal))
            retVal = input;

        return retVal;
    }

    inline std::string spirit_qi(const std::string& input, bsq::symbols<char, std::string> const& symbols)
    {
        std::string retVal;
        retVal.reserve(input.size() * 2);

        auto beg = input.cbegin();
        auto end = input.cend();

        if(!bsq::parse(beg, end, *(symbols | +(bsq::char_ - symbols)), retVal))
            retVal = input;

        return retVal;
    }
#endif

inline std::string manual_expand(const std::string& input, TokenMap const& tokens) {
    std::ostringstream builder;
    auto expand = [&](auto const& key) {
        auto match = tokens.find(key);

        if (match == tokens.end())
            builder << "$(" << key << ")";
        else
            builder << match->second;
    };

    builder.str().reserve(input.size()*2);
    std::ostreambuf_iterator<char> out(builder);

    for(auto f(input.begin()), l(input.end()); f != l;) {
        switch(*f) {
            case '$' : {
                    if (++f==l || *f!='(') {
                        *out++ = '$';
                        break;
                    }
                    else {
                        auto s = ++f;
                        size_t n = 0;

                        while (f!=l && *f != ')')
                            ++f, ++n;

                        // key is [s,f] now
                        expand(std::string(&*s, &*s+n));

                        if (f!=l)
                            ++f; // skip '}'
                    }
                }
            default:
                *out++ = *f++;
        }
    }
    return builder.str();
}

inline std::string boost_replace_all(const std::string& input, TokenMap const& tokens)
{
    std::string retVal(input);
    retVal.reserve(input.size() * 2);

    for(const auto& token : tokens)
    {
        boost::replace_all(retVal, token.first, token.second);
    }
    return retVal;
}

inline void naive_stl(std::string& input, TokenMap const& tokens)
{
    input.reserve(input.size() * 2);
    for(const auto& token : tokens)
    {
        auto next = std::search(input.cbegin(), input.cend(), token.first.begin(), token.first.end());
        while(next != input.cend())
        {
            input.replace(next, next + token.first.size(), token.second);
            next = std::search(input.cbegin(), input.cend(), token.first.begin(), token.first.end());
        }
    }
}

inline void boyer_more(std::string& input, TokenMap const& tokens)
{
    input.reserve(input.size() * 2);
    for(const auto& token : tokens)
    {
        auto next =
            boost::algorithm::boyer_moore_search(input.cbegin(), input.cend(), token.first.begin(), token.first.end());
        while(next != input.cend())
        {
            input.replace(next, next + token.first.size(), token.second);
            next = boost::algorithm::boyer_moore_search(input.cbegin(), input.cend(), token.first.begin(),
                                                        token.first.end());
        }
    }
}

inline void bmh_search(std::string& input, TokenMap const& tokens)
{
    input.reserve(input.size() * 2);
    for(const auto& token : tokens)
    {
        auto next = boost::algorithm::boyer_moore_horspool_search(input.cbegin(), input.cend(), token.first.begin(),
                                                                  token.first.end());
        while(next != input.cend())
        {
            input.replace(next, next + token.first.size(), token.second);
            next = boost::algorithm::boyer_moore_search(input.cbegin(), input.cend(), token.first.begin(),
                                                        token.first.end());
        }
    }
}

inline void kmp_search(std::string& input, TokenMap const& tokens)
{
    input.reserve(input.size() * 2);
    for(const auto& token : tokens)
    {
        auto next = boost::algorithm::knuth_morris_pratt_search(input.cbegin(), input.cend(), token.first.begin(),
                                                                token.first.end());
        while(next != input.cend())
        {
            input.replace(next, next + token.first.size(), token.second);
            next = boost::algorithm::boyer_moore_search(input.cbegin(), input.cend(), token.first.begin(),
                                                        token.first.end());
        }
    }
}

namespace testdata {
    std::string const expected =
        "Five and Seven said nothing, but looked at Two. Two began in a low voice, 'Why the fact is, you see, Miss, "
        "this here ought to have been a red rose-tree, and we put a white one in by mistake; and if the Queen was to "
        "find it out, we should all have our heads cut off, you know. So you see, Miss, we're doing our best, afore "
        "she comes, to—' At this moment Five, who had been anxiously looking across the garden, called out 'The Queen! "
        "The Queen!' and the three gardeners instantly threw themselves flat upon their faces. There was a sound of "
        "many footsteps, and Alice looked round, eager to see the Queen.First came ten soldiers carrying clubs; these "
        "were all shaped like the three gardeners, oblong and flat, with their hands and feet at the corners: next the "
        "ten courtiers; these were ornamented all over with diamonds, and walked two and two, as the soldiers did. "
        "After these came the royal children; there were ten of them, and the little dears came jumping merrily along "
        "hand in hand, in couples: they were all ornamented with hearts. Next came the guests, mostly Kings and "
        "Queens, and among them Alice recognised the White Rabbit: it was talking in a hurried nervous manner, smiling "
        "at everything that was said, and went by without noticing her. Then followed the Knave of Hearts, carrying "
        "the King's crown on a crimson velvet cushion; and, last of all this grand procession, came THE KING AND QUEEN "
        "OF HEARTS.Alice was rather doubtful whether she ought not to lie down on her face like the three gardeners, "
        "but she could not remember ever having heard of such a rule at processions; 'and besides, what would be the "
        "use of a procession,' thought she, 'if people had all to lie down upon their faces, so that they couldn't see "
        "it?' So she stood still where she was, and waited.When the procession came opposite to Alice, they all "
        "stopped and looked at her, and the Queen said severely 'Who is this?' She said it to the Knave of Hearts, who "
        "only bowed and smiled in reply.'Idiot!' said the Queen, tossing her head impatiently; and, turning to Alice, "
        "she went on, 'What's your name, child?''My name is Alice, so please your Majesty,' said Alice very politely; "
        "but she added, to herself, 'Why, they're only a pack of cards, after all. I needn't be afraid of them!''And "
        "who are these?' said the Queen, pointing to the three gardeners who were lying round the rosetree; for, you "
        "see, as they were lying on their faces, and the pattern on their backs was the same as the rest of the pack, "
        "she could not tell whether they were gardeners, or soldiers, or courtiers, or three of her own children.'How "
        "should I know?' said Alice, surprised at her own courage. 'It's no business of mine.'The Queen turned crimson "
        "with fury, and, after glaring at her for a moment like a wild beast, screamed 'Off with her head! "
        "Off—''Nonsense!' said Alice, very loudly and decidedly, and the Queen was silent.The King laid his hand upon "
        "her arm, and timidly said 'Consider, my dear: she is only a child!'The Queen turned angrily away from him, "
        "and said to the Knave 'Turn them over!'The Knave did so, very carefully, with one foot.'Get up!' said the "
        "Queen, in a shrill, loud voice, and the three gardeners instantly jumped up, and began bowing to the King, "
        "the Queen, the royal children, and everybody else.'Leave off that!' screamed the Queen. 'You make me giddy.' "
        "And then, turning to the rose-tree, she went on, 'What have you been doing here?'";
    std::string const inputWithtokens =
        "Five and Seven said nothing, but looked at $(Two). $(Two) began in a low voice, 'Why the fact is, you see, "
        "Miss, "
        "this here ought to have been a red rose-tree, and we put a white one in by mistake; and if the Queen was to "
        "find it out, we should all have our $(heads) cut off, you know. So you see, Miss, we're doing our best, afore "
        "she comes, to—' At this moment Five, who had been anxiously looking across the garden, called out 'The Queen! "
        "The Queen!' and the three gardeners instantly threw themselves flat upon their faces. There was a sound of "
        "many footsteps, and Alice looked round, eager to see the $(Queen).First came ten soldiers carrying clubs; "
        "these "
        "were all shaped like the three gardeners, oblong and flat, with their hands and feet at the corners: next the "
        "ten courtiers; these were ornamented all over with $(diamonds), and walked two and two, as the soldiers did. "
        "After these came the royal children; there were ten of them, and the little dears came jumping merrily along "
        "hand in hand, in couples: they were all ornamented with hearts. Next came the guests, mostly Kings and "
        "Queens, and among them Alice recognised the White Rabbit: it was talking in a hurried nervous manner, smiling "
        "at everything that was said, and went by without noticing her. Then followed the Knave of Hearts, carrying "
        "the King's crown on a crimson velvet cushion; and, last of all this grand procession, came THE KING AND QUEEN "
        "OF HEARTS.Alice was rather doubtful whether she ought not to lie down on her face like the three gardeners, "
        "but she could not remember ever having heard of such a rule at processions; 'and besides, what would be the "
        "use of a procession,' thought she, 'if people had all to lie down upon their faces, so that they couldn't see "
        "it?' So she stood still where she was, and waited.When the procession came opposite to Alice, they all "
        "stopped and looked at her, and the $(Queen) said severely 'Who is this?' She said it to the Knave of Hearts, "
        "who "
        "only bowed and smiled in reply.'Idiot!' said the Queen, tossing her head impatiently; and, turning to Alice, "
        "she went on, 'What's your name, child?''My name is Alice, so please your Majesty,' said Alice very politely; "
        "but she added, to herself, 'Why, they're only a pack of cards, after all. I needn't be afraid of them!''And "
        "who are these?' said the $(Queen), pointing to the three gardeners who were lying round the rosetree; for, "
        "you "
        "see, as they were lying on their faces, and the $(pattern) on their backs was the same as the rest of the "
        "pack, "
        "she could not tell whether they were gardeners, or soldiers, or courtiers, or three of her own children.'How "
        "should I know?' said Alice, surprised at her own courage. 'It's no business of mine.'The Queen turned crimson "
        "with fury, and, after glaring at her for a moment like a wild beast, screamed 'Off with her head! "
        "Off—''Nonsense!' said $(Alice), very loudly and decidedly, and the Queen was silent.The $(King) laid his hand "
        "upon "
        "her arm, and timidly said 'Consider, my dear: she is only a child!'The $(Queen) turned angrily away from him, "
        "and said to the $(Knave) 'Turn them over!'The $(Knave) did so, very carefully, with one foot.'Get up!' said "
        "the "
        "Queen, in a shrill, loud voice, and the three gardeners instantly jumped up, and began bowing to the King, "
        "the Queen, the royal children, and everybody else.'Leave off that!' screamed the Queen. 'You make me giddy.' "
        "And then, turning to the rose-tree, she went on, 'What have you been doing here?'";

    static TokenMap const raw_tokens {
        {"Two", "Two"},           {"heads", "heads"},
        {"diamonds", "diamonds"}, {"Queen", "Queen"},
        {"pattern", "pattern"},   {"Alice", "Alice"},
        {"King", "King"},         {"Knave", "Knave"},
        {"Why", "Why"},           {"glaring", "glaring"},
        {"name", "name"},         {"know", "know"},
        {"Idiot", "Idiot"},       {"children", "children"},
        {"Nonsense", "Nonsense"}, {"procession", "procession"},
    };

    static TokenMap const tokens {
        {"$(Two)", "Two"},           {"$(heads)", "heads"},
        {"$(diamonds)", "diamonds"}, {"$(Queen)", "Queen"},
        {"$(pattern)", "pattern"},   {"$(Alice)", "Alice"},
        {"$(King)", "King"},         {"$(Knave)", "Knave"},
        {"$(Why)", "Why"},           {"$(glaring)", "glaring"},
        {"$(name)", "name"},         {"$(know)", "know"},
        {"$(Idiot)", "Idiot"},       {"$(children)", "children"},
        {"$(Nonsense)", "Nonsense"}, {"$(procession)", "procession"},
    };

}

NONIUS_BENCHMARK("manual_expand", [](nonius::chronometer cm)     {
    std::string const tmp = testdata::inputWithtokens;
    auto& tokens = testdata::raw_tokens;

    std::string result;
    cm.measure([&](int) {
        result = manual_expand(tmp, tokens);
    });
    assert(result == testdata::expected);
})

#ifdef USE_X3
NONIUS_BENCHMARK("spirit_x3", [](nonius::chronometer cm) {
    auto const symbols = [&] {
        x3::symbols<char const*> symbols;
        for(const auto& token : testdata::tokens) {
            symbols.add(token.first.c_str(), token.second.c_str());
        }
        return symbols;
    }();

    std::string result;
    cm.measure([&](int) {
            result = spirit_x3(testdata::inputWithtokens, symbols);
        });
    //std::cout << "====\n" << result << "\n====\n";
    assert(testdata::expected == result);
})
#else
NONIUS_BENCHMARK("spirit_qi", [](nonius::chronometer cm) {
    auto const symbols = [&] {
        bsq::symbols<char, std::string> symbols;
        for(const auto& token : testdata::tokens) {
            symbols.add(token.first.c_str(), token.second.c_str());
        }
        return symbols;
    }();

    std::string result;
    cm.measure([&](int) {
            result = spirit_qi(testdata::inputWithtokens, symbols);
        });
    assert(testdata::expected == result);
})

NONIUS_BENCHMARK("spirit_qi_old", [](nonius::chronometer cm) {
    std::string result;
    cm.measure([&](int) {
            result = spirit_qi_old(testdata::inputWithtokens, testdata::tokens);
        });
    assert(testdata::expected == result);
})
#endif

NONIUS_BENCHMARK("boyer_more", [](nonius::chronometer cm) {
    cm.measure([&](int) {
        std::string tmp = testdata::inputWithtokens;
        boyer_more(tmp, testdata::tokens);
        assert(tmp == testdata::expected);
    });
})

NONIUS_BENCHMARK("bmh_search", [](nonius::chronometer cm) {
    cm.measure([&](int) {
        std::string tmp = testdata::inputWithtokens;
        bmh_search(tmp, testdata::tokens);
        assert(tmp == testdata::expected);
    });
})

NONIUS_BENCHMARK("kmp_search", [](nonius::chronometer cm) {
    cm.measure([&](int) {
        std::string tmp = testdata::inputWithtokens;
        kmp_search(tmp, testdata::tokens);
        assert(tmp == testdata::expected);
    });
})

NONIUS_BENCHMARK("naive_stl", [](nonius::chronometer cm) {
    cm.measure([&](int) {
            std::string tmp = testdata::inputWithtokens;
            naive_stl(tmp, testdata::tokens);
            assert(tmp == testdata::expected);
        });
})

NONIUS_BENCHMARK("boost_replace_all", [](nonius::chronometer cm)     {
    std::string const tmp = testdata::inputWithtokens;

    std::string result;
    cm.measure([&](int) {
        result = boost_replace_all(testdata::inputWithtokens, testdata::tokens);
    });
    assert(result == testdata::expected);
})

Belonging answered 21/12, 2015 at 22:10 Comment(10)

Added full comparative benchmarks. You can see that nothing beats my manual approach (8µs), and spirit X3 comes in second (22µs). As expected, the replacing methods are slower. Recorded live streams: part 1 and part 2 – Belonging 21/12, 2015 at 22:55

I doubt that it is faster, but x3::raw[x3::seek[&symbols|x3::eoi]][appender] % symbols[appender]; also works. – Penuchle 21/12, 2015 at 23:15

@cv_and_he Oh I still forgot, lemme compare. Wow! It's as fast as the manually coded one (code). I was soooo close - thanks for fixing it using the eoi :) – Belonging 21/12, 2015 at 23:23

@sehe, item 2 - do you mean the for loop in the function for building symbols? it should be there, in real life the token map changes each time – Parve 22/12, 2015 at 4:58

@sehe, what is the X3 status? has it reached full operational capability in 1.60? As for lexer, have to check it, completely unfamiliar with this part od spirit, but not sure (from boost examples) it fits the purpose in this particular case. Ok, going to check your solutions and post results here – Parve 22/12, 2015 at 5:16

@sehe, whats the point of boost::container::flat_map? better locality? did you see any gains in this particular case? – Parve 22/12, 2015 at 5:32

ummm... is X3 supposed to compile with VS2015? – Parve 22/12, 2015 at 6:5

@Belonging meanwhile checked the manual expand. Of course it was the fastest one, but, it was 3 times slower than MFC (which I guess you cant and most likely, dont want to check :) ) The stream was consuming too much CPU, so I rewrote it to work with std::string, now it just 1.9 times slower than MFC implementation. – Parve 22/12, 2015 at 7:29

@Parve In case you are interested in a Spirit Qi version of one of the two X3 approaches you can try this. I think they should be similar, but it may (or may not) have different performance characteristics(due to the internal machinery of both libraries). – Penuchle 22/12, 2015 at 10:6

@cv_and_he, meanwhile I cant compile sehe example with VS2015, checking it with Boost.Spirit mail list guys, but looks like it doesn't have a support for VS – Parve 22/12, 2015 at 15:54

EDIT2 for MFCMimicking: Well from your code it's obvious why the MFC version is faster: It doesn't search the entire string each replacement like some of your other versions do (I still can't explain boost::spirit). As soon as it does a replace it starts searching from the replacement point, not from the beginning of the string, so it's completely obvious that will be faster.

EDIT: After doing some more research and seeing (Algorithm to find multiple string matches) it seems that using good single-string-match algorithms for finding multiple search terms is the actual problem here. Probably your best bet is to use an appropriate algorithm (a few are mentioned in that question).

As for why MFC is faster? I would suggest distilling that down to a different question "Why are delete and insert in CString so much faster than std::string" or something like that, and make sure you tag it C++ and MFC so people with the right expertise can help (I have experience with standard C++ but can't help on what optimizations VC++ put into CString).

Original answer: OK due to the huge volume of code I only looked at expandTokens3 but I assume all versions have the same problem. Your code has two potentially significant performance issues:

You search the entire string each time you do a replacement. If you have ten variables to replace in a string it will take on the order of ten times longer than needed.
You execute each replacement in-place in the input string rather than building up the result string from each piece. This can result in memory allocation and copy for each replacement you do, again possibly significantly increasing runtime.

Yellowweed answered 21/12, 2015 at 19:9 Comment(5)

Agree with the first one, you can try to find a token in input string and then match it in token map. Strongly disagree with second. Have both versions, no significant change. If you dont know the size of result you cannot avoid allocations, in this particular case, no allocation will occur since the result string will be always smaller than input. And, of course I've profiled it, wouldn't say that most time was spent on allocations. The point is, how come the ancient, dreaded, disgusting MFC beats everything? – Parve 21/12, 2015 at 19:20

@Parve With a basic heuristic estimation of the overall replacement increase you can minimize reallocations. – Yellowweed 21/12, 2015 at 19:24

you can, I've left it out of the equation (mentioned it in initial question) for simplicity – Parve 21/12, 2015 at 19:26

On your edit: you are missing very important point, replacement doesnt take the lion's share of the time, search does. So if distilling to one short question it should be something like "What the heck they did to CString::Find to work that fast". As for tags: this is the inconvenient part of SO, have to remove boost tag to add the MFC. And thanks for the link, will look into the grep stuff. – Parve 22/12, 2015 at 5:23

Re.: EDIT2 my findings expressly show that the bottleneck is strstr vs. std::search or even std::find. That's completely irrespective of whether the "re"-search starts at the known offset. – Belonging 28/12, 2015 at 11:43

So the question, what takes so long in such a simple task? One can say, ok, simple task, go ahead and implement it better. But the reality is that 15 years old MFC naive implementation does the task order of magnitude faster

The answer is actually simple.

First i compiled up your code on my macbook pro using apple clang 7.0:

$ cc --version
Apple LLVM version 7.0.0 (clang-700.1.76)
Target: x86_64-apple-darwin15.2.0
Thread model: posix

The results seem to match the OP's...

Boost.Spirit symbols result: 3425?=3425
10000 cycles took 8906ms.
Boyer-Moore results:3425?=3425
10000 cycles took 2891ms.
Boyer Moore Hospool result:3425?=3425
10000 cycles took 2392ms.
Knuth Morris Pratt result: 3425?=3425
10000 cycles took 4363ms.
Naive STL search and replace result: 3425?=3425
10000 cycles took 4333ms.
Boost replace_all result:3425?=3425
10000 cycles took 23284ms.
MFCMimicking result:3425?=3425
10000 cycles took 426ms.    <-- seemingly outstanding, no?

Then I added the -O3 flag:

Boost.Spirit symbols result: 3425?=3425
10000 cycles took 675ms.
Boyer-Moore results:3425?=3425
10000 cycles took 788ms.
Boyer Moore Hospool result:3425?=3425
10000 cycles took 623ms.
Knuth Morris Pratt result: 3425?=3425
10000 cycles took 1623ms.

Naive STL search and replace result: 3425?=3425
10000 cycles took 562ms.                    <-- pretty good!!!

Boost replace_all result:3425?=3425
10000 cycles took 748ms.
MFCMimicking result:3425?=3425
10000 cycles took 431ms.                    <-- awesome but not as outstanding as it was!

And now the results are in the same order of magnitude as the MFC CString results.

Why?

Because when you compile against BOOST and/or STL, you are expanding templates and the library code takes on the same optimisation settings as your compilation unit.

When you link against MFC, you're linking against a shared library that was compiled with optimisations turned on.

When you use strstr you're calling into the c library which is precompiled, optimised and in some parts, hand-written. Of course it's going to be fast!

solved :)

10000 cycles not 100000, different machine...

for reference, here are the results for the 100,000 cycles version running on the laptop on battery power. Full optimisation (-O3):

Boost.Spirit symbols result: 3425?=3425
100000 cycles took 6712ms.
Boyer-Moore results:3425?=3425
100000 cycles took 7923ms.
Boyer Moore Hospool result:3425?=3425
100000 cycles took 6091ms.
Knuth Morris Pratt result: 3425?=3425
100000 cycles took 16330ms.

Naive STL search and replace result: 3425?=3425
100000 cycles took 6719ms.

Boost replace_all result:3425?=3425
100000 cycles took 7353ms.

MFCMimicking result:3425?=3425
100000 cycles took 4076ms.

Reliance answered 22/12, 2015 at 12:47 Comment(10)

Clang 7.0? Also, this completely does not correspond to my findings. Clang 3.7 and GCC 5.2 did not make Spirit Qi among the faster. Also, that ignores the fact that it's still a lot faster to do a manual approach as in my answer or the MFCMimicking version the OP added in the end. – Belonging 22/12, 2015 at 15:0

@Belonging Apple have branched away from the main clang repo (idiots!) and have different version numbers. – Reliance 22/12, 2015 at 15:2

@RichardHodges It clearly shows superiority of Clang, Apple fine tuned highend hardware and engineering marvel of OSX. Just kidding of course. You ran it 10k cycles when I did it 100k cycles, a zero which easy to miss hence the tenfold "improvement" ;) Put aside you didn't run (of course) the MFC version, so you are not comparing apples to Apple since these results from different reference machines – Parve 22/12, 2015 at 16:38

@Parve added the mfc mimicking for comparison. I think you'll find it proves the point. – Reliance 22/12, 2015 at 16:47

of course it is not, it just proves that your compiler not that good at optimizing stuff. Check my edit, compare BM and MFC for example, 13:1 with -O2 10k iterations, 10:1 with -O3, 100k iterations. everything as expected. try to run GCC, try to run it on Coliru, AFAIR it runs with -O3 – Parve 22/12, 2015 at 17:7

@Parve the code you have linked to on Coliru has this line: size_t cycles = 10'000; which if I am not mistaken, is ten thousand. Are you sure that you tried it on MFC with 100,000? The difference in orders of magnitude would match a mistake made here. – Reliance 22/12, 2015 at 17:13

@RichardHodges, Coliru does not allow the code to run for too long, so I have to cut it down for 10k. In my EDIT003 I ran code on my machine and each result states the number of cycles it has ran, as you can see these result correspond to number of cycles, 10k yield 10 times (almost) faster results than 100k, which is kinda obvious – Parve 22/12, 2015 at 17:30

@Parve right... but the initial claim was that the MFC (native) version was 10x as fast as the stl. I believe this is a result of human error, not better algorithms in MFC. Suggest you check the code running on the windows machine to ensure that the number of cycles there is correct, and that it's building in release mode (optimisations on). I think you'll find that MFC is not as quick as it first appeared. – Reliance 22/12, 2015 at 18:2

Let us continue this discussion in chat. – Parve 22/12, 2015 at 18:21

@RichardHodges any insights on this odd issue you encountered with CLang on OSX? – Parve 28/12, 2015 at 13:43

Ok, it will be a long story. Just to remind you questions asked.

Why the search and replace using C++ (various approaches) so slow?
Why the MFC search and replace so fast?

Surprisingly, both questions have the same answer. Because of C++ overhead. Yep. Our shiny modern C++ has an overhead which mostly we dismiss for the flexibility and elegance.

However, when it comes to sub-microsecond resolutions (not that C++ is not capable of doing things in nanosecond resolutions) the overhead becomes more prominent.

Let me showcase with the same code I've posted in the question, but it is more coherent with things done in each function.

Live On Coliru

It uses aforementioned Nonius (thanks to @sehe) and interactive results are here.

You can click the legend to show/hide particular series.

Conclusions

There are two outstanding results

MFC mimicking function and
my own manual replace

These functions at least one order of magnitude are faster than the rest, so what is the difference?

I will not mark it as an answer for a while, waiting for comments on this conclusion.

I want to thank all those who actively participated in the discussion, raised ideas and pointed out inconsistencies in my example.

2019 update:
Just to preserve the code: https://github.com/kreuzerkrieg/string_search_replace
Nonius results: https://kreuzerkrieg.github.io/string_search_replace/

Ran with gcc-9 on Ubuntu 18.04

Parve answered 23/12, 2015 at 16:21 Comment(6)

I've tested it, then I've changed it a little, and most likely introduced some bug there. will fix it later, dont have an access to my machine today – Parve 24/12, 2015 at 5:4

@sehe, just tested it, works as expected, also checked the version of code that I've posted, it is the same one I'm running locally. Odd story... – Parve 27/12, 2015 at 5:20

ah, darn! The morale is dont run #define dependent code and expect same results! And thanks for assistance! – Parve 28/12, 2015 at 5:23

Could you please accept this answer so the ultimate conclusions get more airing? – Belonging 29/12, 2015 at 21:35

Just some update. I ran the original STL code (with search) vs MFC-inspired one and I got that with optimizations (-O2) stl-base gives 228ms while MFC-like gives 285ms. Without optimizations I get something like 7284ms vs 310ms. I do it on macbook2016Pro with i7-6700HQ CPU @ 2.60GHz. So basically the code which uses strstr could not be optimized while STL code was heavily optimized.

Then I ran the final version of naiveSTL code which uses find instead of search and it gave me 28ms. So definitely it is a winner. I added the code below in case if the link by @kreuzerkrieg will be invalid one day.

inline void naiveSTL(std::string& input, const TokenMap& tokens)
{
    input.reserve(input.size() * 2);
    for(const auto& token : tokens)
    {
        auto next = input.find(token.first);
        while(next != std::string::npos)
        {
            input.replace(next, token.first.size(), token.second);
            next = input.find(token.first, next + 1);
        }
    }
}

Upshot answered 19/7, 2019 at 12:15 Comment(1)

Indeed, stuff changed.I put together all code and dropped it to the github repo. Nonius results are available online. The answer is updated with links. However, bottom line not changed :) – Parve 20/7, 2019 at 16:53

Your questionable use of std::string:replace is so pointlessly slow that nothing else in the code matters.

Gallagher answered 21/12, 2015 at 19:9 Comment(4)

this pointless replace beats the cool kid on the block - boost.spirit – Parve 21/12, 2015 at 19:25

I'm not aware of Boost.Spirit making any performance claims. It's purpose is ease-of-use, not blindingly fast performance for string matching. – Masquerade 21/12, 2015 at 20:45

@MaxLybbert: Boost.Spirit does occasionally claim to be extremely fast, and I don't think ease of use is really its forte. Maybe "ease of use, for those who are already experts" :) boost-spirit.com/home/2014/09/03/… I think its pretty hard to compete on low level algorithms for string matching anyways, there are only so many things you can do. – Juice 21/12, 2015 at 21:53

@MaxLybbert In case you want to see, added comparative benchmarks with charts to my answer – Belonging 21/12, 2015 at 22:59

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Conclusions

About the Qi version

Manual Loops

Other ideas

COMPARISONS:

Conclusions

Recommended topics

Hot tags