Implementing a "string pool" that is guaranteed not to move
Asked Answered
G

3

9

I need a "string pool" object into which I can repeatedly insert a "sequence of chars" (I use this phrase to mean "string" without confusing it with std::string or a C string), obtain a pointer to the sequence, and be guaranteed that the pointer will not become invalidated if/when the pool needs to grow. Using a simple std::string as the pool won't work, because of the possibility for the string to be reallocated when it outgrows its initial capacity, thus invalidating all previous pointers into it.

The pool will not grow without bound -- there are well-defined points at which I will call a clear() method on it -- but I don't want to reserve any maximum capacity on it, either. It should be able to grow, without moving.

One possibility I'm considering is inserting each new sequence of chars into a forward_list<string> and obtaining begin()->c_str(). Another is inserting into an unordered_set<string>, but I'm having a hard time finding out what happens when an unordered_set has to grow. The third possibility I'm considering (less enthusiastically) is rolling my own chain of 1K buffers into which I concatenate the sequence of chars. That has the advantage (I guess) of having the highest performance, which is a requirement for this project.

I'd be interested in hearing how others would recommend approaching this.

UPDATE 1: edited to clarify my use of the phrase "sequence of chars" to be equivalent to the general notion of a "string" without implying either std::string or null-terminated char array.

Guffey answered 5/1, 2014 at 20:0 Comment(14)
Add one extra level of indirection, like node-based containers do?Scrivings
@KerrekSB: To an unordered_set, I assume you mean. (I believe that, once allocated, a forward_list node won't move.)Guffey
Any node-based container has the property that container elements don't move, so any of those should satisfy your requirements. The price is the extra indirection or cost of traversal.Scrivings
I'm wondering how the following applies to your statement that "container elements don't move" .. kera.name/articles/2011/06/iterator-invalidation-rules-c0x .. would, e.g., my_unordered_set.insert("foo") return an iterator that would remain valid even after "rehashing"?Guffey
Read more carefully: Iterators are invalidated, but references aren't.Scrivings
Doesn't a simple std::deque satisfy the requirements? (If you only add to the beginning or end)Azazel
@KerrekSB, I'm not going to be keeping references. I'm going to be keeping a pointer, or an iterator, to the sequence of chars. So I'm questioning your assertion that "any node-based container has the property that container elements don't move, so any of those should satisfy your requirements", where my requirement is to keep pointers. (I can see that I'd be safe storing pointers in the container, at the cost of the indirection, but I'm not convinced I can use my_unordered_set.insert("foo") and expect the returned iterator to remain valid over time.) Maybe I'm missing your point.Guffey
Evaluating a reference gives an lvalue, so you can take its address. The validity of the reference implies the validity of the address thus obtained. Something like T * p = &*it, etc.Scrivings
@DyP - I believe you're correct - I didn't realize that deque guaranteed never to invalidate iterators.Guffey
@KerrekSB: forgive my confusion, but suppose it points to an entry in a vector: are you saying that p remains valid even after appending thousands more entries onto the vector? (Presuming that at some point the vector needs to be grown)Guffey
No, a vector isn't a node-based container. Node-based containers are std::list and all the associative standard library containers. Incidentally, std::deque also preserves references, but not iterators, for insertions at the ends.Scrivings
@Guffey A deque is not guaranteed to never invalidate iterators. It preserves pointers and references, but invalidates iterators.Whitehurst
@jalf: And only for operations at the ends, that is. Insertions or deletions in the middle invalidate everything.Scrivings
I will return to this question as soon as I understand the distinction between pointers, iterators, and references in this context. At the moment I don't understand how some can remain valid while others become invalid.Guffey
M
12

I've used this approach in the past:

using Atom = const char*;

Atom make_atom(string const& value)
{
    static set<string> interned;
    return interned.insert(value).first->c_str();
}

Obviously, if you want/need to clear the set, you'd make it available in some wider scope.

For even more efficiency move/emplace the strings into the set.

Update I've added this approach for completeness. See it Live on Coliru

#include <string>
#include <set>
using namespace std;

using Atom = const char*;

template <typename... Args>
typename enable_if<
    is_constructible<string, Args...>::value, Atom
>::type emplace_atom(Args&&... args)
{
    static set<string> interned;
    return interned.emplace(forward<Args>(args)...).first->c_str();
}

#include <iostream>

int main() {
    cout << emplace_atom("Hello World\n");
    cout << emplace_atom(80, '=');
}
Miry answered 6/1, 2014 at 23:24 Comment(8)
This may be the simplest way to go: as I read The C++ Standard Library by Josuttis, insert (or emplace) won't invalidate pointers. I think I could also use unordered_set -- although since the only lookups are done once, at insertion time, I don't imagine the difference between O(1) and O(log(n)) would matter much. And, not that it matters much either, I like that a set consolidates equal values. Could you confirm that unordered_set would also work here?Guffey
It does. In fact it can be faster for larger n depending on the payload and the hash algorithm usedMiry
You can't move a const& string - the caller still needs it. The current setup is good enough: it will make a copy if and only if a actual insertion is done.Plutonic
@Plutonic Erm. That's rather obvious. I don't believe my code shows otherwise? I just mentioned it. (You could make the make_atom function variadic and accepting by universal reference. In which case I'd think about renaming it emplace_atom(Args&&...))Miry
Oh, I interpreted that as meaning interned.emplace(value) or interned.insert(std::move(value)), i.e. not affecting the signature. Sorry.Plutonic
@Chap: Associative containers have the property that "The insert and emplace members shall not affect the validity of iterators and references to the container, and the erase members shall invalidate only iterators and references to the erased elements" (from the C++11 standard). unordered_set is an associative container.Faints
Great stuff. I've no idea what the enable_if is about, but I'll work with it.Guffey
I'd suggest the first version unless you have a need for the other (it will be fine. Consider paste.ubuntu.com/6719541 to magically trade the unneeded copy for an extra move if you call make_atom on a temporary. (By the way, the enable_if is somewhat redundant in that it checks that the initializer arguments can be used to construct a string from. If you leave it out, the compiler will still bark trying to compile the emplace call, but the error is potentially more confusing)Miry
P
1

Yes, you're going to have to write a list of buffers. No, don't do all the hard work yourself.

The underlying datastructure should be a std::vector<std::string>. Using a (forward) list doesn't buy you a whole lot. When the vector is resized, the strings are moved efficiently. std::forward_list<std::string>. Even if the list is resized, the strings themselves remain in place. Iterating the list is only needed for a .clear so the list performance is not critical.

The wrapper class should abstract away the addition of new strings. A new string should be added when the capacity of the last string isn't enough to add the new string. When you add a new string, reserve all the memory a chunk will need - this ensures the capacity will be large enough to prevent reallocations later on.

This setup may waste some space when a large new allocation forces the use of a new chunk, leaving part of an older chunk unused. You could of course remember the size remaining in the last N blocks, for a small value of N such that those chunks might still be in cache. But it's quite possible that in your app N=5 would already be too big.

Plutonic answered 6/1, 2014 at 23:15 Comment(3)
You say "when the vector is resized, the strings are moved efficiently." What happens to the char*'s into those strings, that are required to remain valid?Guffey
Good question. Formally, no. I believe they will remain valid on efficient implementations (copying the strings takes time).Plutonic
As long as we're selecting containers for stability of reference/iterators, you could look at Boost's stable_vector as well.Miry
A
0

Recapping, your requirements are:

  • Be able to push elements
  • Be able to obtain an iterator to the beginning of the sequence
  • Iterators should not get invalidated if the sequence grows
  • Be able to clear the sequence
  • Don't reserve maximum capacity

It seems that std::list<char> fits perfectly into this list of requirements. Of course you might need a wrapper around the class to make it behave exactly like std::string, but that really depends on how you manipulate the data.

And here's how well it fits the requirements:

  • To push elements, you can use the push_back and emplace_back member functions.

  • std::begin(container) or the member function begin will retrieve the iterator to the first element of the sequence.

  • Addition, removal and moving the elements within the list or across several lists does not invalidate the iterators. An iterator is invalidated only when the corresponding element is deleted.

  • To clear the sequence you can use the member function clear.

  • Most of the time it is implemented as a doubly-linked list, therefore no capacity is reserved.

Since std::list seems memory inefficient (even though the standard doesn't specify the size of it nor its implementation), it's correct to add that you can also use std::deque<char> with almost the same interface as above. The only difference being that std::deque might reserve unused memory.

Alphonse answered 5/1, 2014 at 20:53 Comment(7)
Isn't a std::list<char> rather memory-inefficient? One/two pointers per byte of data?Azazel
The "elements" are const sequences of chars - c-strings, without the null terminator (the client who adds this "element" knows its length). At the time the element is added to the pool, we want to obtain an iterator to the first char of the element. It is the pool that will grow, not the individual elements. I'm not sure we're using the words "element" and "sequence" in the same way... Otherwise, I think you've recapped the requirements correctly.Guffey
@Chap: it sounds that you want to store a char[N] then. Is N the same for all strings? (How else would the client know the length of a random string?)Plutonic
@Plutonic yes, char[N]. N can vary: the client tells the object to store a string, whose length he knows, and gets back a char*. From then on it's the client's responsibility to remember both the pointer and the length.Guffey
@DyP You can avoid one pointer if you use a std::forward_list (it's not guaranteed though), but then I'm not sure you can efficiently find the first char and push to the end of the string. If you have those kind of requirements, I'm not sure you can avoid this space cost. Also pre-optimization is evil, have you actually profiled the OP's application and found out that it is consuming too much memory?Alphonse
std::list<char> won't do it, and I think there's confusion about the requirements. If I add "Foo" to the pool, I want to receive a pointer to the contiguous characters 'F', 'o', and 'o'. A simple const char[].Guffey
@Chap, Then when you add the new characters just register the iterator to the first character. Otherwise use std::list<std::string> and get the c-string via the c_str member function.Alphonse

© 2022 - 2024 — McMap. All rights reserved.