Does Iterator::collect allocate the same amount of memory as String::with_capacity?
Asked Answered
I

1

6

In C++ when joining a bunch of strings (where each element's size is known roughly), it's common to pre-allocate memory to avoid multiple re-allocations and moves:

std::vector<std::string> words;
constexpr size_t APPROX_SIZE = 20;

std::string phrase;
phrase.reserve((words.size() + 5) * APPROX_SIZE);  // <-- avoid multiple allocations
for (const auto &w : words)
  phrase.append(w);

Similarly, I did this in Rust (this chunk needs the unicode-segmentation crate)

fn reverse(input: &str) -> String {
    let mut result = String::with_capacity(input.len());
    for gc in input.graphemes(true /*extended*/).rev() {
        result.push_str(gc)
    }
    result
}

I was told that the idiomatic way of doing it is a single expression

fn reverse(input: &str) -> String {
  input
      .graphemes(true /*extended*/)
      .rev()
      .collect::<Vec<&str>>()
      .concat()
}

While I really like it and want to use it, from a memory allocation point of view, would the former allocate less chunks than the latter?

I disassembled this with cargo rustc --release -- --emit asm -C "llvm-args=-x86-asm-syntax=intel" but it doesn't have source code interspersed, so I'm at a loss.

Incident answered 29/10, 2019 at 16:6 Comment(8)
the "single expression" form should probably be a fold and not use a collectOutlive
Iterator implementation for Graphemes has size_hint(), which is being used by String for buffer size estimation in its FromIterator implementation, so I don't think there will be huge overhead due to use of collect().Anarchism
@DenysSéguret You mean like .fold(String::with_capacity(input.len()), |result, gc| result + gc) instead of .collect::<Vec<&str>>().concat()?Incident
@DanilaKiver Thanks for commenting about size_hint; didn't know about it. Would the number of memory allocation requests/calls be one like in the first approach? I think for every grapheme-cluster there'll be one allocation due to the corresponding Vec::push and then a final allocation for concat. The reason I ask isn't specific to this toy example, I'm trying to understand how the second approach works. Knowing it will be helpful in a larger project.Incident
@Incident yes, in order to avoid collecting into an intermediate vector. But at this point I'd need to do some benchmarks before I'd go as far as recommending it.Outlive
@legends2k, after re-reading size_hint() implementation I realized that it uses 1 as the lower bound, and the code which reserves the space based on the hint relies on the lower bound too (both for String and Vec), so it feels like there actually will be problems with excessive allocations with this particular type (Graphemes).Anarchism
Or the single expression form could have .collect::<String>() instead of .collect::<Vec<&str>>().concat()Fbi
@SomeGuy Or simply .collect() like in trentcl's answer.Incident
A
9

Your original code is fine and I do not recommend changing it.

The original version allocates once: inside String::with_capacity.

The second version allocates at least twice: first, it creates a Vec<&str> and grows it by pushing &strs onto it. Then, it counts the total size of all the &strs and creates a new String with the correct size. (The code for this is in the join_generic_copy method in str.rs.) This is bad for several reasons:

  1. It allocates unnecessarily, obviously.
  2. Grapheme clusters can be arbitrarily large, so the intermediate Vec can't be usefully sized in advance -- it just starts at size 1 and grows from there.
  3. For typical strings, it allocates way more space than would actually be needed just to store the end result, because &str is usually 16 bytes in size while a UTF-8 grapheme cluster is typically much less than that.
  4. It wastes time iterating over the intermediate Vec to get the final size where you could just take it from the original &str.

On top of all this, I wouldn't even consider this version idiomatic, because it collects into a temporary Vec in order to iterate over it, instead of just collecting the original iterator, as you had in an earlier version of your answer. This version fixes problem #3 and makes #4 irrelevant but doesn't satisfactorily address #2:

input.graphemes(true).rev().collect()

collect uses FromIterator for String, which will try to use the lower bound of the size_hint from the Iterator implementation for Graphemes. However, as I mentioned earlier, extended grapheme clusters can be arbitrarily long, so the lower bound can't be any greater than 1. Worse, &strs may be empty, so FromIterator<&str> for String doesn't know anything about the size of the result in bytes. This code just creates an empty String and calls push_str on it repeatedly.

Which, to be clear, is not bad! String has a growth strategy that guarantees amortized O(1) insertion, so if you have mostly tiny strings that won't need to be reallocated often, or you don't believe the cost of allocation is a bottleneck, using collect::<String>() here may be justified if you find it more readable and easier to reason about.

Let's go back to your original code.

let mut result = String::with_capacity(input.len());
for gc in input.graphemes(true).rev() {
    result.push_str(gc);
}

This is idiomatic. collect is also idiomatic, but all collect does is basically the above, with a less accurate initial capacity. Since collect doesn't do what you want, it's not unidiomatic to write the code yourself.

There is a slightly more concise, iterator-y version that still makes only one allocation. Use the extend method, which is part of Extend<&str> for String:

fn reverse(input: &str) -> String {
    let mut result = String::with_capacity(input.len());
    result.extend(input.graphemes(true).rev());
    result
}

I have a vague feeling that extend is nicer, but both of these are perfectly idiomatic ways of writing the same code. You should not rewrite it to use collect, unless you feel that expresses the intent better and you don't care about the extra allocation.

Related

Ataractic answered 29/10, 2019 at 17:4 Comment(3)
I'm a big fan of the extend version, and that's what I'd pick.Diminutive
I think what I like about extend is that it fronts both the verb and the object being affected. You don't have to understand the whole line to see what side effect it has: it extends result. Using collect or fold fronts input, so you see what's being iterated over first, then the form of iteration, and the "most important" bit -- allocating and filling the string -- is "buried" at the very end of the expression.Ataractic
I would note that if the function is used multiple times, such as in a loop, it may be beneficial to provide a version which takes a &mut String out parameter. This may, in certain circumstances, allow reusing the buffer over and over.Cannabin

© 2022 - 2024 — McMap. All rights reserved.