C++ - String capacity pattern
Asked Answered
D

1

5

I noticed that the string capacities in C++ follow this pattern:

  • initial string size is 15
  • for any string that is larger than a particular size 'block', the capacity is doubled.

Here are the string capacities for strings up to length 500:

15
30
60
120
240
480
960

Capacities were found with the following C++ program:

#include <iostream>
#include <vector>

using namespace std;

string getstr(int len) { 
    string s = "";
    for (int i=0; i<len; i++) {
        s.append("1");
    }
    return s;
}

int main() {
    vector<int> capacities;
    int prevcap;
    for (int i=0; i<500; i++) {
        int cap = getstr(i).capacity();
        if (cap > prevcap) {
            capacities.push_back(cap);
            prevcap = cap;
        }
    }
    for (int i : capacities) {
        cout << i << endl;
    }
}

What is the logic behind choosing this algorithm? Do the numbers (here 15 and 2) have any significance, or have they been randomly chosen? Also, does this algorithm vary from compiler to compiler? (This was compiled and tested with g++ 5.4.0 on Ubuntu 16.04) Any insights are appreciated.

Decile answered 29/9, 2020 at 5:8 Comment(7)
This is implementation dependant. On my compiler (clang 12.0.0) I am getting : 22 47 95 191 383 767. I don't think there is any specific pattern here. gcc seems to be just doubling the capacity each time from your observations.Wyndham
Side note: There should be a #include <string> in there.Alphitomancy
My guesses would be that initial 15 comes from Small String Optimization (SSO) and doubling is used since it is cheap operation (just bitshift). I am not library designer thoughBernardabernardi
akrzemi1.wordpress.com/2014/04/14/common-optimizationsIleus
It ensures amortized O(1) push_backDareen
There isn't any magic.. Compilers have leeway in how they implement any growth algorithm, but if you look it is basically allocate for some amount of characters and double when you need more. It is a fair balance between growth and the number of allocations required.Pierian
The basic idea is to minimize calls to malloc/new for each new element, but to grow the string with increasing chunks of memory to lower the allocation cost. When you know the final size of the string in advance, you can call reserve() to limit the allocation callsDendrite
A
10

Doubling is a well known method. It amortizes the cost of reallocation making push_back a constant time operation (as it is required to be). The 'obvious' alternative of adding a fixed size would make push_back a linear time operation. Other patterns are possible though, any multiplicative increase would work theoretically, and I once read an article advocating that each increased capacity should be taken from the next term in a Fibonacci sequence.

I imagine the initial size of 15 is chosen with short string optimization (SSO) in mind. With SSO the string data is stored in the string object itself instead of in separately allocated memory. I imagine 15 is the largest short string that can be accommodated in this particular implementation. Knowing what sizeof(std::string) is might shed some light on this.

Alembic answered 29/9, 2020 at 5:18 Comment(2)
"I once read an article advocating that each increased capacity should be taken from the next term in a Fibonacci sequence" - alas no C++ standard library vendor has listened to this excellent advice.Josephson
@Josephson I don't recall the reasons for the advice exactly, something to do with the 'holes' left in the heap when memory is freed, but I do remember thinking that the authors had a rather idealised view of how memory is used in typical programs. I'm not aware of any real world testing.Alembic

© 2022 - 2024 — McMap. All rights reserved.