Most optimized way of concatenation in strings

Asked 19/9, 2013 at 10:31 Answered 18/2, 2022 at 10:18

We always came across many situation on daily basis wherein we have to do tedious and very many string operations in our code. We all know that string manipulations are expensive operations. I would like to know which is the least expensive among the available versions.

The most common operations is concatenation(This is something that we can control to some extent). What is the best way to concatenate std::strings in C++ and various workarounds to speed up concatenation?

I mean,

std::string l_czTempStr;

1).l_czTempStr = "Test data1" + "Test data2" + "Test data3";

2). l_czTempStr =  "Test data1"; 
    l_czTempStr += "Test data2";
    l_czTempStr += "Test data3";

3). using << operator

4). using append()

Also, do we get any advantage of using CString over std::string?

Eelpout answered 19/9, 2013 at 10:31 Comment(11)

Why can't you measure? Anyway, stringstream is built for this use case, string is not. So it is probably a good bet to start out with stringstream. – Bicapsular 19/9, 2013 at 10:34

1. is not legal, ITYM l_czTempStr = std::string("Test data1") + "Test data2" + "Test data3";. Other than that the answer is to time the different techniques. There are so many variables that it's is impossible to answer the question. The answer depends on the number and length of strings you are working with, plus the platform you are compiling on, and the platform you are compiling for. – Consolute 19/9, 2013 at 10:35

is it actually a bottleneck? Then benchmark it. In general, the fastest method is to pre-allocate enough space for all the data before appending any of it, and avoid using temporaries (+ creates a new object, with some special cases in C++11). But don't optimise this unless you need to or your code will be unreadable. – Witty 19/9, 2013 at 10:36

@MagnusHoff You've got it backwards. std::ostringstream is designed for formatting, and should normally only be used when you need formatting. All of his data are strings, so std::string and concatenation are the preferred solution. – Petua 19/9, 2013 at 10:42

As a side note: For very long strings, using a Rope instead of a string might be a good idea. – Coumas 19/9, 2013 at 10:49

@JamesKanze I see. I have, however, had really poor experience with the performance of string concatenation. Do you have a link that can elaborate on this "preferred solution" claim, that I might educate myself? – Bicapsular 19/9, 2013 at 10:57

@MagnusHoff Do you need more than the definition of std::ostream? Or "formatted output" (the term used in the standard to describe << on an std::ostream)? – Petua 19/9, 2013 at 11:0

@JamesKanze That is definitely enough to make me reconsider. But as I said, I have contradictory experience. Experience tends to trump a two word argument ;) But if you don't have a link on hand, let's not pollute this comment thread further :) – Bicapsular 19/9, 2013 at 11:7

@MagnusHoff: This is true in Java, C# or Python because they use String Interning, meaning that any string is immutable. The C++ Standard specific guarantees preclude interning (or Copy On Write) so normally direct appending to a string is as efficient as you can get and using a ostream can only add overhead to the process (multiple virtual calls). – Lycurgus 19/9, 2013 at 13:15

Do you mean to reuse the same buffer (l_czTempStr) over and over, or do you use a new buffer each time ? Do you have the ability to reserve the memory when creating the buffer or do you append blind (causing reallocations) ? – Lycurgus 19/9, 2013 at 13:17

Possible duplicate of Efficient string concatenation in C++ – Ursala 25/6, 2017 at 8:32

Here is a small test suite:

#include <iostream>
#include <string>
#include <chrono>
#include <sstream>

int main ()
{
    typedef std::chrono::high_resolution_clock clock;
    typedef std::chrono::duration<float, std::milli> mil;
    std::string l_czTempStr;
    std::string s1="Test data1";
    auto t0 = clock::now();
    #if VER==1
    for (int i = 0; i < 100000; ++i)
    {
        l_czTempStr = s1 + "Test data2" + "Test data3";
    }
    #elif VER==2
    for (int i = 0; i < 100000; ++i)
    {
        l_czTempStr =  "Test data1"; 
        l_czTempStr += "Test data2";
        l_czTempStr += "Test data3";
    }
    #elif VER==3
    for (int i = 0; i < 100000; ++i)
    {
        l_czTempStr =  "Test data1"; 
        l_czTempStr.append("Test data2");
        l_czTempStr.append("Test data3");
    }
    #elif VER==4
    for (int i = 0; i < 100000; ++i)
    {
        std::ostringstream oss;
        oss << "Test data1";
        oss << "Test data2";
        oss << "Test data3";
        l_czTempStr = oss.str();
    }
    #endif
    auto t1 = clock::now();
    std::cout << l_czTempStr << '\n';
    std::cout << mil(t1-t0).count() << "ms\n";
}

On coliru:

Compile with the following:

clang++ -std=c++11 -O3 -DVER=1 -Wall -pedantic -pthread main.cpp

21.6463ms

-DVER=2

6.61773ms

-DVER=3

6.7855ms

-DVER=4

102.015ms

It looks like 2), += is the winner.

(Also compiling with and without -pthread seems to affect the timings)

Coaptation answered 19/9, 2013 at 10:34 Comment(13)

Nice! You numbering has 3) and 4) swapped wrt the question. Given the not too big differences it looks like the only firm conclusion is to avoid streams. This of course not only depends on the compiler (revision), but also on the stdlib implementation (I think one of GCC's on coliru). – Naumachia 19/9, 2013 at 11:12

The test may not be representative, unfortunately. The issue is that by not including l_czTempStr declaration within the loop, versions 2 and 3 reuse the same buffer over and over, whilst version 1 creates a new buffer std::string{""} each time. Your benchmark demonstrated that reusing the same buffer instead of allocating/deallocating provided a 5x speed-up (the speed-up was obvious, the factor depends on how long the pieces are and how many re-allocations occur if you do not reserve everything up-front). I am unsure whether the OP meant to be re-using the same buffer or not, though. – Lycurgus 19/9, 2013 at 13:13

@MatthieuM.: +1 good point. I updated the code so the initial string is created beforehand in version 1, however operator+ still suffers from a lot of allocating/deallocating internally. – Coaptation 19/9, 2013 at 21:51

Your stringstream benchmark includes the (incredibly) slow timing of the construction of the stream. – Overboard 16/2, 2014 at 3:1

@Rapptz: Yes, but if you move it out of the loop you have still have to clear the contents each loop. – Coaptation 20/2, 2014 at 3:46

What if you use ostringstream::clear instead of construction? – Multiparous 29/9, 2015 at 12:46

@oferei: It should be a little faster than construction. It still does not compare to the other ways though. Also, note you need to reset the string to empty. clear only resets the error flags. – Coaptation 29/9, 2015 at 13:3

yes this is flawed because the oss object is constructed/destructed every loop. If instead you construct it ahead of the loop and instead do a oss.str("") at the beginning of the loop, it is significantly faster (3x in my testing) – Seamstress 10/1, 2017 at 14:59

Also, for kicks, I changed one of the strings to a number that had to be converted to a string. For the first couple of cases I used std::to_string which slowed things down almost 10x. However in the case of ostringstream, it didn't make a difference (didn't need to use std::to_string, which apparently is really slow) – Seamstress 10/1, 2017 at 15:3

"It looks like 2), += is the winner." I'm not sure your results show any statistical difference from .append(). And nor should they: one of these is implemented in terms of the other. I don't see why they would be significantly different. – Ursala 25/6, 2017 at 8:31

People are right about ostringstream construction inside the loop. It's not a fair comparison and the test should be redone. Use oss.str("") to clear on each iteration. – Docila 17/10, 2018 at 15:35

what about push_back? can that be 5th version? – Largent 21/11, 2018 at 4:4

Indeed Matthieu M.'s point is valid. I tried the following variant and the speed diff became negligible: void f(std::string* s, int i) { *s = std::to_string(i); } .. f(&s1, i); l_czTempStr = s1; VER1: 53.986ms VER2: 53.2521ms – Loincloth 28/12, 2018 at 6:40

In addition to other answers...

I made extensive benchmarks about this problem some time ago, and came to the conclusion that the most efficient solution (GCC 4.7 & 4.8 on Linux x86 / x64 / ARM) in all use cases is first to reserve() the result string with enough space to hold all the concatenated strings, and then only append() them (or use operator +=(), that makes no difference).

Unfortunately it seems I deleted that benchmark so you only have my word (but you can easily adapt Mats Petersson's benchmark to verify this by yourself, if my word isn't enough).

In a nutshell:

const string space = " ";
string result;
result.reserve(5 + space.size() + 5);
result += "hello";
result += space;
result += "world";

Depending on the exact use case (number, types and sizes of the concatenated strings), sometimes this method is by far the most efficient, and other times it is on par with other methods, but it is never worse.

Problem is, this is really painful to compute the total required size in advance, especially when mixing string literals and std::string (the example above is clear enough on that matter, I believe). The maintainability of such code is absolutely horrible as soon as you modify one of the literals or add another string to be concatenated.

One approach would be to use sizeof to compute the size of the literals, but IMHO it creates as much mess than it solves, the maintainability is still terrible:

#define STR_HELLO "hello"
#define STR_WORLD "world"

const string space = " ";
string result;
result.reserve(sizeof(STR_HELLO)-1 + space.size() + sizeof(STR_WORLD)-1);
result += STR_HELLO;
result += space;
result += STR_WORLD;

A usable solution (C++11, variadic templates)

I finally settled for a set of variadic templates that efficiently take care of calculating the string sizes (eg. the size of string literals is determined at compile time), reserve() as needed, and then concatenate everything.

Here it is, hope this is useful:

namespace detail {

  template<typename>
  struct string_size_impl;

  template<size_t N>
  struct string_size_impl<const char[N]> {
    static constexpr size_t size(const char (&) [N]) { return N - 1; }
  };

  template<size_t N>
  struct string_size_impl<char[N]> {
    static size_t size(char (&s) [N]) { return N ? strlen(s) : 0; }
  };

  template<>
  struct string_size_impl<const char*> {
    static size_t size(const char* s) { return s ? strlen(s) : 0; }
  };

  template<>
  struct string_size_impl<char*> {
    static size_t size(char* s) { return s ? strlen(s) : 0; }
  };

  template<>
  struct string_size_impl<std::string> {
    static size_t size(const std::string& s) { return s.size(); }
  };

  template<typename String> size_t string_size(String&& s) {
    using noref_t = typename std::remove_reference<String>::type;
    using string_t = typename std::conditional<std::is_array<noref_t>::value,
                                              noref_t,
                                              typename std::remove_cv<noref_t>::type
                                              >::type;
    return string_size_impl<string_t>::size(s);
  }

  template<typename...>
  struct concatenate_impl;

  template<typename String>
  struct concatenate_impl<String> {
    static size_t size(String&& s) { return string_size(s); }
    static void concatenate(std::string& result, String&& s) { result += s; }
  };

  template<typename String, typename... Rest>
  struct concatenate_impl<String, Rest...> {
    static size_t size(String&& s, Rest&&... rest) {
      return string_size(s)
           + concatenate_impl<Rest...>::size(std::forward<Rest>(rest)...);
    }
    static void concatenate(std::string& result, String&& s, Rest&&... rest) {
      result += s;
      concatenate_impl<Rest...>::concatenate(result, std::forward<Rest>(rest)...);
    }
  };

} // namespace detail

template<typename... Strings>
std::string concatenate(Strings&&... strings) {
  std::string result;
  result.reserve(detail::concatenate_impl<Strings...>::size(std::forward<Strings>(strings)...));
  detail::concatenate_impl<Strings...>::concatenate(result, std::forward<Strings>(strings)...);
  return result;
}

The only interesting part, as far as the public interface is concerned, is the very last template<typename... Strings> std::string concatenate(Strings&&... strings) template. Usage is straightforward:

int main() {
  const string space = " ";
  std::string result = concatenate("hello", space, "world");
  std::cout << result << std::endl;
}

With optimizations turned on, any decent compiler should be able to expand the concatenate call to the same code as my first example where I manually wrote everything. As far as GCC 4.7 & 4.8 are concerned, the generated code is pretty much identical as well as the performance.

Roede answered 19/9, 2013 at 15:34 Comment(4)

I don't understand the reason for using universal references here. Could you explain what advantage they could provide over normal (lvalue) references to const? – Tired 22/9, 2015 at 2:39

Have you done a wstring implementation of this? It looks like it would be straightforward. – Katlaps 12/7, 2017 at 23:57

with or without "reserve" function shows no diff in my test , sometimes "reserve" function worse. – Implication 16/11, 2017 at 3:5

@Hao, perhaps "hello world" fits in a std::string that uses the small string optimization, thus the reserve operation effectively does nothing. – Spoil 27/1, 2020 at 17:7

The WORST possible scenario is using plain old strcat (or sprintf), since strcat takes a C string, and that has to be "counted" to find the end. For long strings, that's a real performance sufferer. C++ style strings are much better, and the performance problems are likely to be with the memory allocation, rather than counting lengths. But then again, the string grows geometrically (doubles each time it needs to grow), so it's not that terrible.

I'd very much suspect that all of the above methods end up with the same, or at least very similar, performance. If anything, I'd expect that stringstream is slower, because of the overhead in supporting formatting - but I also suspect it's marginal.

As this sort of thing is "fun", I will get back with a benchmark...

Edit:

Note that these result apply to MY machine, running x86-64 Linux, compiled with g++ 4.6.3. Other OS's, compilers and C++ runtime library implementations may vary. If performance is important to your application, then benchmark on the system(s) that are critical for you, using the compiler(s) that you use.

Here's the code I wrote to test this. It may not be the perfect representation of a real scenario, but I think it's a representative scenario:

#include <iostream>
#include <iomanip>
#include <string>
#include <sstream>
#include <cstring>

using namespace std;

static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

string build_string_1(const string &a, const string &b, const string &c)
{
    string out = a + b + c;
    return out;
}

string build_string_1a(const string &a, const string &b, const string &c)
{
    string out;
    out.resize(a.length()*3);
    out = a + b + c;
    return out;
}

string build_string_2(const string &a, const string &b, const string &c)
{
    string out = a;
    out += b;
    out += c;
    return out;
}

string build_string_3(const string &a, const string &b, const string &c)
{
    string out;
    out = a;
    out.append(b);
    out.append(c);
    return out;
}


string build_string_4(const string &a, const string &b, const string &c)
{
    stringstream ss;

    ss << a << b << c;
    return ss.str();
}


char *build_string_5(const char *a, const char *b, const char *c)
{
    char* out = new char[strlen(a) * 3+1];
    strcpy(out, a);
    strcat(out, b);
    strcat(out, c);
    return out;
}



template<typename T>
size_t len(T s)
{
    return s.length();
}

template<>
size_t len(char *s)
{
    return strlen(s);
}

template<>
size_t len(const char *s)
{
    return strlen(s);
}



void result(const char *name, unsigned long long t, const string& out)
{
    cout << left << setw(22) << name << " time:" << right << setw(10) <<  t;
    cout << "   (per character: " 
         << fixed << right << setw(8) << setprecision(2) << (double)t / len(out) << ")" << endl;
}

template<typename T>
void benchmark(const char name[], T (Func)(const T& a, const T& b, const T& c), const char *strings[])
{
    unsigned long long t;

    const T s1 = strings[0];
    const T s2 = strings[1];
    const T s3 = strings[2];
    t = rdtsc();
    T out = Func(s1, s2, s3);
    t = rdtsc() - t; 

    if (len(out) != len(s1) + len(s2) + len(s3))
    {
        cout << "Error: out is different length from inputs" << endl;
        cout << "Got `" << out << "` from `" << s1 << "` + `" << s2 << "` + `" << s3 << "`";
    }
    result(name, t, out);
}


void benchmark(const char name[], char* (Func)(const char* a, const char* b, const char* c), 
               const char *strings[])
{
    unsigned long long t;

    const char* s1 = strings[0];
    const char* s2 = strings[1];
    const char* s3 = strings[2];
    t = rdtsc();
    char *out = Func(s1, s2, s3);
    t = rdtsc() - t; 

    if (len(out) != len(s1) + len(s2) + len(s3))
    {
        cout << "Error: out is different length from inputs" << endl;
        cout << "Got `" << out << "` from `" << s1 << "` + `" << s2 << "` + `" << s3 << "`";
    }
    result(name, t, out);
    delete [] out;
}


#define BM(func, size) benchmark(#func " " #size, func, strings ## _ ## size)


#define BM_LOT(size) BM(build_string_1, size); \
    BM(build_string_1a, size); \
    BM(build_string_2, size); \
    BM(build_string_3, size); \
    BM(build_string_4, size); \
    BM(build_string_5, size);

int main()
{
    const char *strings_small[]  = { "Abc", "Def", "Ghi" };
    const char *strings_medium[] = { "abcdefghijklmnopqrstuvwxyz", 
                                     "defghijklmnopqrstuvwxyzabc", 
                                     "ghijklmnopqrstuvwxyzabcdef" };
    const char *strings_large[]   = 
        { "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"
          "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"
          "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"
          "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"
          "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"
          "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"
          "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"
          "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"
          "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"
          "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz", 

          "defghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabc" 
          "defghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabc" 
          "defghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabc" 
          "defghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabc" 
          "defghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabc"

          "defghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabc" 
          "defghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabc" 
          "defghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabc" 
          "defghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabc" 
          "defghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabc", 

          "ghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdef"
          "ghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdef"
          "ghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdef"
          "ghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdef"
          "ghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdef"
          "ghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdef"
          "ghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdef"
          "ghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdef"
          "ghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdef"
          "ghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdef"
        };

    for(int i = 0; i < 5; i++)
    {
        BM_LOT(small);
        BM_LOT(medium);
        BM_LOT(large);
        cout << "---------------------------------------------" << endl;
    }
}

Here are some representative results:

build_string_1 small   time:      4075   (per character:   452.78)
build_string_1a small  time:      5384   (per character:   598.22)
build_string_2 small   time:      2669   (per character:   296.56)
build_string_3 small   time:      2427   (per character:   269.67)
build_string_4 small   time:     19380   (per character:  2153.33)
build_string_5 small   time:      6299   (per character:   699.89)
build_string_1 medium  time:      3983   (per character:    51.06)
build_string_1a medium time:      6970   (per character:    89.36)
build_string_2 medium  time:      4072   (per character:    52.21)
build_string_3 medium  time:      4000   (per character:    51.28)
build_string_4 medium  time:     19614   (per character:   251.46)
build_string_5 medium  time:      6304   (per character:    80.82)
build_string_1 large   time:      8491   (per character:     3.63)
build_string_1a large  time:      9563   (per character:     4.09)
build_string_2 large   time:      6154   (per character:     2.63)
build_string_3 large   time:      5992   (per character:     2.56)
build_string_4 large   time:     32450   (per character:    13.87)
build_string_5 large   time:     15768   (per character:     6.74)

Same code, run as 32-bit:

build_string_1 small   time:      4289   (per character:   476.56)
build_string_1a small  time:      5967   (per character:   663.00)
build_string_2 small   time:      3329   (per character:   369.89)
build_string_3 small   time:      3047   (per character:   338.56)
build_string_4 small   time:     22018   (per character:  2446.44)
build_string_5 small   time:      3026   (per character:   336.22)
build_string_1 medium  time:      4089   (per character:    52.42)
build_string_1a medium time:      8075   (per character:   103.53)
build_string_2 medium  time:      4569   (per character:    58.58)
build_string_3 medium  time:      4326   (per character:    55.46)
build_string_4 medium  time:     22751   (per character:   291.68)
build_string_5 medium  time:      2252   (per character:    28.87)
build_string_1 large   time:      8695   (per character:     3.72)
build_string_1a large  time:     12818   (per character:     5.48)
build_string_2 large   time:      8202   (per character:     3.51)
build_string_3 large   time:      8351   (per character:     3.57)
build_string_4 large   time:     38250   (per character:    16.35)
build_string_5 large   time:      8143   (per character:     3.48)

From this, we can conclude:

The best option is appending a bit at a time (out.append() or out +=), with the "chained" approach reasonably close.
Pre-allocating the string is not helpful.
Using stringstream is pretty poor idea (between 2-4x slower).
The char * uses new char[]. Using a local variable in the calling function makes it the fastest - but slightly unfairly to compare that.
There is a fair bit of overhead in combining short string - just copying data should be at most one cycle per byte [unless the data doesn't fit in the cache].

edit2

Added, as per comments:

string build_string_1b(const string &a, const string &b, const string &c)
{
    return a + b + c;
}

and

string build_string_2a(const string &a, const string &b, const string &c)
{
    string out;
    out.reserve(a.length() * 3);
    out += a;
    out += b;
    out += c;
    return out;
}

Which gives these results:

build_string_1 small   time:      3845   (per character:   427.22)
build_string_1b small  time:      3165   (per character:   351.67)
build_string_2 small   time:      3176   (per character:   352.89)
build_string_2a small  time:      1904   (per character:   211.56)

build_string_1 large   time:      9056   (per character:     3.87)
build_string_1b large  time:      6414   (per character:     2.74)
build_string_2 large   time:      6417   (per character:     2.74)
build_string_2a large  time:      4179   (per character:     1.79)

(A 32-bit run, but the 64-bit shows very similar results on these).

Betide answered 19/9, 2013 at 10:44 Comment(4)

Nice benchmark, +1. Concerning 1a (pre-allocating the string) in reality you are throwing away the pre-allocated buffer: the result of operator +() is a temporary that is moved (or RVO'd) into out so the pre-allocation is useless. An interesting benchmark would be to create 2a / 3a cases where you reserve() the result string ahead, and then append() or += all the parameters to your result string. As I explain in my answer, I did such benchmarks some time ago and came to the conclusion that this is indeed the most efficient solution. – Roede 19/9, 2013 at 15:40

I took your code and added a build_string_1b function that just did return a + b + c; which turned out to be the fastest function in some runs (VS2012). – Canarese 19/9, 2013 at 15:46

Just nitpicking concerning 2a: currently you have two memory allocations (copy a then reserve), this could be further improved by using reserve on an empty string, and only then += all your parameters (which gives you a single memory allocation, the reserve). I'd edit that myself but the timings are for your machine so I'll let you do it. ;) – Roede 19/9, 2013 at 15:59

@Syam: I did actually write that eventually, but I must have copied the code wront - results are for the code I've posted now. – Betide 19/9, 2013 at 16:4

As with most micro-optimisations, you will need to measure the effect of each option, having first established through measurement that this is indeed a bottle-neck worth optimising. There is no definitive answer.

append and += should do exactly the same thing.

+ is conceptually less efficient, since you're creating and destroying temporaries. Your compiler may or may not be able to optimise this to be as fast as appending.

Calling reserve with the total size may reduce the number of memory allocations needed - they will probably be the biggest bottleneck.

<< (presumably on a stringstream) may or may not be faster; you'll need to measure that. It's useful if you need to format non-string types, but probably won't be particularly better or worse at dealing with strings.

CString has the disadvantage that it's not portable, and that a Unix hacker like me can't tell you what its advantages may or may not be.

Alejandroalejo answered 19/9, 2013 at 10:45 Comment(0)

I decided to run a test with the code provided by user Jesse Good, slightly modified to take into account the observation of Rapptz, specifically the fact that the ostringstream was constructed in each single iteration of the loop. Therefore I added some cases, a couple of them being the ostringstream cleared with the sequence "oss.str(""); oss.clear()"

Here is the code

#include <iostream>
#include <string>
#include <chrono>
#include <sstream>
#include <functional>


template <typename F> void time_measurement(F f, const std::string& comment)
{
    typedef std::chrono::high_resolution_clock clock;
    typedef std::chrono::duration<float, std::milli> mil;
    std::string r;
    auto t0 = clock::now();
    f(r);
    auto t1 = clock::now();
    std::cout << "\n-------------------------" << comment << "-------------------\n" <<r << '\n';
    std::cout << mil(t1-t0).count() << "ms\n";
    std::cout << "---------------------------------------------------------------------------\n";

}

inline void clear(std::ostringstream& x)
{
    x.str("");
    x.clear();
}

void test()
{
    std:: cout << std::endl << "----------------String Comparison---------------- " << std::endl;
    const int n=100000;
    {
        auto f=[](std::string& l_czTempStr)
        {
            std::string s1="Test data1";
            for (int i = 0; i < n; ++i)
            {
                l_czTempStr = s1 + "Test data2" + "Test data3";
            }
        };
        time_measurement(f, "string, plain addition");
   }

   {
        auto f=[](std::string& l_czTempStr)
        {
            for (int i = 0; i < n; ++i)
            {
                l_czTempStr =  "Test data1";
                l_czTempStr += "Test data2";
                l_czTempStr += "Test data3";
            }
        };
        time_measurement(f, "string, incremental");
    }

    {
         auto f=[](std::string& l_czTempStr)
         {
            for (int i = 0; i < n; ++i)
            {
                l_czTempStr =  "Test data1";
                l_czTempStr.append("Test data2");
                l_czTempStr.append("Test data3");
            }
         };
         time_measurement(f, "string, append");
     }

    {
         auto f=[](std::string& l_czTempStr)
         {
            for (int i = 0; i < n; ++i)
            {
                std::ostringstream oss;
                oss << "Test data1";
                oss << "Test data2";
                oss << "Test data3";
                l_czTempStr = oss.str();
            }
         };
         time_measurement(f, "oss, creation in each loop, incremental");
     }

    {
         auto f=[](std::string& l_czTempStr)
         {
            std::ostringstream oss;
            for (int i = 0; i < n; ++i)
            {
                oss.str("");
                oss.clear();
                oss << "Test data1";
                oss << "Test data2";
                oss << "Test data3";
            }
            l_czTempStr = oss.str();
         };
         time_measurement(f, "oss, 1 creation, incremental");
     }

    {
         auto f=[](std::string& l_czTempStr)
         {
            std::ostringstream oss;
            for (int i = 0; i < n; ++i)
            {
                oss.str("");
                oss.clear();
                oss << "Test data1" << "Test data2" << "Test data3";
            }
            l_czTempStr = oss.str();
         };
         time_measurement(f, "oss, 1 creation, plain addition");
     }

    {
         auto f=[](std::string& l_czTempStr)
         {
            std::ostringstream oss;
            for (int i = 0; i < n; ++i)
            {
                clear(oss);
                oss << "Test data1" << "Test data2" << "Test data3";
            }
            l_czTempStr = oss.str();
         };
         time_measurement(f, "oss, 1 creation, clearing calling inline function, plain addition");
     }


    {
         auto f=[](std::string& l_czTempStr)
         {
            for (int i = 0; i < n; ++i)
            {
                std::string x;
                x =  "Test data1";
                x.append("Test data2");
                x.append("Test data3");
                l_czTempStr=x;
            }
         };
         time_measurement(f, "string, creation in each loop");
     }

}

Here are the results:

/*

g++ "qtcreator debug mode"
----------------String Comparison---------------- 

-------------------------string, plain addition-------------------
Test data1Test data2Test data3
11.8496ms
---------------------------------------------------------------------------

-------------------------string, incremental-------------------
Test data1Test data2Test data3
3.55597ms
---------------------------------------------------------------------------

-------------------------string, append-------------------
Test data1Test data2Test data3
3.53099ms
---------------------------------------------------------------------------

-------------------------oss, creation in each loop, incremental-------------------
Test data1Test data2Test data3
58.1577ms
---------------------------------------------------------------------------

-------------------------oss, 1 creation, incremental-------------------
Test data1Test data2Test data3
11.1069ms
---------------------------------------------------------------------------

-------------------------oss, 1 creation, plain addition-------------------
Test data1Test data2Test data3
10.9946ms
---------------------------------------------------------------------------

-------------------------oss, 1 creation, clearing calling inline function, plain addition-------------------
Test data1Test data2Test data3
10.9502ms
---------------------------------------------------------------------------

-------------------------string, creation in each loop-------------------
Test data1Test data2Test data3
9.97495ms
---------------------------------------------------------------------------


g++ "qtcreator release mode" (optimized)
----------------String Comparison----------------

-------------------------string, plain addition-------------------
Test data1Test data2Test data3
8.41622ms
---------------------------------------------------------------------------

-------------------------string, incremental-------------------
Test data1Test data2Test data3
2.55462ms
---------------------------------------------------------------------------

-------------------------string, append-------------------
Test data1Test data2Test data3
2.5154ms
---------------------------------------------------------------------------

-------------------------oss, creation in each loop, incremental-------------------
Test data1Test data2Test data3
54.3232ms
---------------------------------------------------------------------------

-------------------------oss, 1 creation, incremental-------------------
Test data1Test data2Test data3
8.71854ms
---------------------------------------------------------------------------

-------------------------oss, 1 creation, plain addition-------------------
Test data1Test data2Test data3
8.80526ms
---------------------------------------------------------------------------

-------------------------oss, 1 creation, clearing calling inline function, plain addition-------------------
Test data1Test data2Test data3
8.78186ms
---------------------------------------------------------------------------

-------------------------string, creation in each loop-------------------
Test data1Test data2Test data3
8.4034ms
---------------------------------------------------------------------------
*/

Now using std::string is still faster, and the append is still the fastest way of concatenation, but ostringstream is no more so incredibly terrible like it was before.

Rusticate answered 14/3, 2018 at 0:2 Comment(0)

So as this question's accepted answer is quite old I've decided to update it's benchmarks with modern compiler and compare both solutions by @jesse-good and template version from @syam

Here is the combined code:

#include <iostream>
#include <string>
#include <chrono>
#include <sstream>
#include <vector>
#include <cstring>


#if VER==TEMPLATE
namespace detail {

  template<typename>
  struct string_size_impl;

  template<size_t N>
  struct string_size_impl<const char[N]> {
    static constexpr size_t size(const char (&) [N]) { return N - 1; }
  };

  template<size_t N>
  struct string_size_impl<char[N]> {
    static size_t size(char (&s) [N]) { return N ? strlen(s) : 0; }
  };

  template<>
  struct string_size_impl<const char*> {
    static size_t size(const char* s) { return s ? strlen(s) : 0; }
  };

  template<>
  struct string_size_impl<char*> {
    static size_t size(char* s) { return s ? strlen(s) : 0; }
  };

  template<>
  struct string_size_impl<std::string> {
    static size_t size(const std::string& s) { return s.size(); }
  };

  template<typename String> size_t string_size(String&& s) {
    using noref_t = typename std::remove_reference<String>::type;
    using string_t = typename std::conditional<std::is_array<noref_t>::value,
                                              noref_t,
                                              typename std::remove_cv<noref_t>::type
                                              >::type;
    return string_size_impl<string_t>::size(s);
  }

  template<typename...>
  struct concatenate_impl;

  template<typename String>
  struct concatenate_impl<String> {
    static size_t size(String&& s) { return string_size(s); }
    static void concatenate(std::string& result, String&& s) { result += s; }
  };

  template<typename String, typename... Rest>
  struct concatenate_impl<String, Rest...> {
    static size_t size(String&& s, Rest&&... rest) {
      return string_size(s)
           + concatenate_impl<Rest...>::size(std::forward<Rest>(rest)...);
    }
    static void concatenate(std::string& result, String&& s, Rest&&... rest) {
      result += s;
      concatenate_impl<Rest...>::concatenate(result, std::forward<Rest>(rest)...);
    }
  };

} // namespace detail

template<typename... Strings>
std::string concatenate(Strings&&... strings) {
  std::string result;
  result.reserve(detail::concatenate_impl<Strings...>::size(std::forward<Strings>(strings)...));
  detail::concatenate_impl<Strings...>::concatenate(result, std::forward<Strings>(strings)...);
  return result;
}

#endif

int main ()
{
typedef std::chrono::high_resolution_clock clock;
typedef std::chrono::duration<float, std::milli> ms;
std::string l_czTempStr;


std::string s1="Test data1";


auto t0 = clock::now();
#if VER==PLUS
for (int i = 0; i < 100000; ++i)
{
    l_czTempStr = s1 + "Test data2" + "Test data3";
}
#elif VER==PLUS_EQ
for (int i = 0; i < 100000; ++i)
{
    l_czTempStr =  "Test data1"; 
    l_czTempStr += "Test data2";
    l_czTempStr += "Test data3";
}
#elif VER==APPEND
for (int i = 0; i < 100000; ++i)
{
    l_czTempStr =  "Test data1"; 
    l_czTempStr.append("Test data2");
    l_czTempStr.append("Test data3");
}
#elif VER==STRSTREAM
for (int i = 0; i < 100000; ++i)
{
    std::ostringstream oss;
    oss << "Test data1";
    oss << "Test data2";
    oss << "Test data3";
    l_czTempStr = oss.str();
}
#elif VER=TEMPLATE
for (int i = 0; i < 100000; ++i)
{
    l_czTempStr = concatenate(s1, "Test data2", "Test data3");
}
#endif

#define STR_(x) #x
#define STR(x) STR_(x)

auto t1 = clock::now();
//std::cout << l_czTempStr << '\n';
std::cout << STR(VER) ": " << ms(t1-t0).count() << "ms\n";
}

The test instruction:

for ARGTYPE in PLUS PLUS_EQ APPEND STRSTREAM TEMPLATE; do for i in `seq 4` ; do clang++ -std=c++11 -O3 -DVER=$ARGTYPE -Wall -pthread -pedantic main.cpp && ./a.out ; rm ./a.out ; done; done

And results (processed through spreadsheet to show average time):

PLUS       23.5792   
PLUS       23.3812   
PLUS       35.1806   
PLUS       15.9394   24.5201
PLUS_EQ    15.737    
PLUS_EQ    15.3353   
PLUS_EQ    10.7764   
PLUS_EQ    25.245    16.773425
APPEND     22.954    
APPEND     16.9031   
APPEND     10.336    
APPEND     19.1348   17.331975
STRSTREAM  10.2063   
STRSTREAM  10.7765   
STRSTREAM  13.262    
STRSTREAM  22.3557   14.150125
TEMPLATE   16.6531   
TEMPLATE   16.629    
TEMPLATE   22.1885   
TEMPLATE   16.9288   18.09985

The surprise is strstream which seems to have a lot of benefit from C++11 and later improvements. Probably removal of necessary allocation due to introduction of move semantics has some influence.

You can test it on your own on coliru

Edit: I've updated test on coliru to use g++-4.8: http://coliru.stacked-crooked.com/a/593dcfe54e70e409. Results in graph here:

(explanation - "stat. average" means average over all values except two extreme ones - one minimal and one maximal value)

Woolpack answered 11/10, 2019 at 11:10 Comment(4)

It's still worth noting that YMMV and you should perform your own benchmarks, as my compile against g++ 4.8.5 with various other flags enabled resulted in stream performance roughly double to that of += and append as well as a 50% increase compared to +. – Feltie 29/1, 2020 at 18:17

@Feltie - 100% true - every aspects matters. However 2x is still much better than ~20x originally reported by Jesse Good. Here are my (combined - sum of 4 runs) results with g++-7 BTW: PLUS: 18.49352, PLUS_EQ: 19.03214, APPEND: 18.70595, STRSTREAM: 19.17043, TEMPLATE: 21.98324 – Woolpack 30/1, 2020 at 10:50

tried to repeat tests with g++-4.8 on coliru: coliru.stacked-crooked.com/a/593dcfe54e70e409, graph here: i.imgur.com/w4TXPO3.png (stat. avg. is average of all minus min. and max. values) – Woolpack 30/1, 2020 at 12:50

I tried your code on my system as well and was surprised that the values differ so much in contrast to what I found with the old code. Looks like the ifdefs do not work properly. And that is why the results are so close. changing it back to numbers and adding appropriate prints under each ifdef gives me: PLUS: 1: 2.20834ms, PLUS_EQ: 2: 0.340278ms, APPEND: 3: 0.310837ms, STRSTREAM: 4: 10.5987ms, TEMPLATE: 5: 2.12325ms gcc-13.2/Ryzen5950X I'd be happy if the stringstream was catching up.... but seams like it didn't – Execution 9/11, 2023 at 11:44

Using C++17 this simple solution should have very good performance, in most cases comparable to @syam's template-heavy solution. In some conditions it will be even faster, avoiding unnecessary strlen calls.

#include <string>
#include <string_view>

template <typename... T>
std::string concat(T ...args) {
    std::string result;
    std::string_view views[] { args... };
    std::string::size_type full_size = 0;
    for (auto sub_view : views)
        full_size += sub_view.size();
    result.reserve(full_size);
    for (auto sub_view : views)
        result.append(sub_view);
    return result;
}

There is a little bit of redundancy here - we don't really need to store string_views, just the length of the arguments. However, the overhead is negligible, and it makes the code clean and clear.

std::string_views store the length of the arguments. Because of that, appending them to a std::string can be faster than appending by char*s. Also, std::string_view uses std::char_traits for length calculation, that in some implementations can be calculated in compile-time for arguments known at compile-time. This optimization usually can't be performed for C calls like strlen.

Synaeresis answered 18/4, 2021 at 1:20 Comment(1)

Consider using const T& ...args instead of T ...args to avoid unnecessary copies of the argument in case an lvalue is passed. – Knotweed 22/12, 2021 at 16:19

#include <concepts>
#include <string>

template<class T>
concept string_like_t = requires(const T & str)
{
    {std::size(str)} -> std::same_as<size_t>;
    {*std::data(str)} -> std::convertible_to<std::remove_cvref_t<decltype(str[0])>>;
};
template<string_like_t T>
using char_t = std::remove_cvref_t<decltype(std::declval<T>()[0])>;

template<class Alloc, string_like_t First, string_like_t... Rest>
    requires (!string_like_t<Alloc>)
auto concat(const Alloc& alloc, const First& first, const Rest&... rest)
{
    std::basic_string<char_t<First>, std::char_traits<char_t<First>>, Alloc> result{ alloc };
    result.reserve(std::size(first) + (std::size(rest) + ...));
    result.append(std::data(first), std::size(first));
    (result.append(std::data(rest), std::size(rest)), ...);
    return result;
}

template<string_like_t First, string_like_t... Rest>
auto concat(const First& first, const Rest&... rest)
{
    typename std::basic_string<char_t<First>>::allocator_type alloc{};
    return concat(alloc, first, rest...);
}

#include <string_view>
#include <iostream>
#include <memory_resource>
int main()
{
    std::pmr::monotonic_buffer_resource mr { 1000 };
    std::pmr::polymorphic_allocator<char> alloc {&mr};
    std::string xxx = "xxxxxx";
    std::string_view yyy = "TEST";
    std::pmr::string zzz {", zzz", &mr};
    std::cout << concat(yyy, "123: ", "test", xxx, zzz) << std::endl;
    std::cout << concat(alloc, yyy, "123: ", "test", xxx, zzz) << std::endl;

    return 0;
}

Seems to be the most optimized C++20 version. Supports polymorphic allocators

Merrel answered 18/2, 2022 at 10:18 Comment(0)

There are some significant parameters, which has potential impact on deciding the "most optimized way". Some of these are - string/content size, number of operations, compiler optimization, etc.

In most of the cases, string::operator+= seems to be working best. However at times, on some compilers, it is also observed that ostringstream::operator<< works best [like - MingW g++ 3.2.3, 1.8 GHz single processor Dell PC]. When compiler context comes, then it is majorly the optimizations at compiler which would impact. Also to mention, that stringstreams are complex objects as compared to simple strings, and therefore adds to the overhead.

For more info - discussion, article.

Douse answered 19/5, 2017 at 3:41 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

A usable solution (C++11, variadic templates)

Recommended topics

Hot tags