set vs unordered_set for fastest iteration
Asked Answered
C

5

18

In my application I have the following requirements -

  1. The data structure will be populated just once with some values (not key/value pairs). The values may be repeated but I want the data structure to store them once only.

  2. I will be iterating 100s of times through all the elements of the data structure created above. The order in which the elements appear in the iteration is immaterial.

Constraint 1 suggests that I will either have to use set or unordered_set as data is not in the form of key value pairs.

Now set insertion is costlier than unordered_set insertion but the data structure is populated only once in the beginning of my program.

I believe the deciding factor will be how fast I can iterate through all the elements of the data structure. I am not sure whether set or unordered_set will be faster for this purpose. I believe the standard makes no mention of this fact as this operation will be O(n) for either data structure. But I wonder for which data structure iterator.next() will be faster.

Cedar answered 1/7, 2014 at 9:27 Comment(19)
So, measure it and see.Marsipobranch
How about a vector?Gutenberg
Of course I can do it, but I am wondering if someone has some insight on this which will be generally applicable regardless of GCC version.Cedar
You can use different structures for different purposes. For example, build a set, then once finished copy/move everything into a vector for fast iteration. Thus the question is: do you need anything else than iteration once it is frozen ? (like fast look-up)Edee
@MatthieuM.: Then you can just as well create the data in the vector right away...Gutenberg
@KerrekSB I initially want to make sure that multiple copies of the data values do not enter the data structure. In case of array testing this for each insertion will make the data structure creation process costly.Cedar
Insert into vector, sort it and remove duplicates?Dibucaine
@AviralGoel: You think the set magically doesn't have to pay that cost?Gutenberg
@JoachimPileborg: std::lower_bound :-SGutenberg
@KerrekSB: then you either risk significantly higher peak memory usage if repeats are very common and you eliminate them after all insertions, or slow things down to O(N<sup>2</sup>) during insertion. Of course memory usage could actually be less if duplicates are uncommon and the memory overheads for the tree are dominant....Derma
Actually, this question is unclear: It depends on whether the data type 1) defines a strict ordering, 2) defines equality, 3) ordering and equality are compatible. Not all options may be available depending on those conditions.Gutenberg
@KerrekSB: I've edited my comment above to clarify that.Derma
@KerrekSB the data values are pointers. I believe pointers define a strict ordering and equality.Cedar
@AviralGoel: Pointers are in fact completely uncomparable, unless you use std::less<T*>...Gutenberg
@TonyD: The theoretical cost is a massive red herring here, since you get dramatic advantages from the vector's memory locality. So you really have to measure. Inserting into the middle of a vector of pointers is actually not terribly slow.Gutenberg
@KerrekSB: I would point out that the OP was not concerned about the insertion performance.Edee
@KerrekSB: for a few hundred or thousand elements, true, but the question asks about big-O efficiency which implies far, far more.Derma
If populated just once, a sorted Vector will be much faster than set in both iterating and finding.Elery
@TNA: Yes, it might be slower than unordered_set in finding though (depending on the number of elements) because it requires log(N) cache lines to be loaded whereas a hash table loads a fixed number of cache lines (for a given load factor).Edee
U
18

There are several approaches.

  1. The comments to your question suggest keeping a std::unordered_set that has the fastest O(1) lookup/insertion and O(N) iteration (as has every container). If you have data that changes a lot, or requires a lot of random lookups, this is probably the fastest. But test.
  2. If you need to iterate 100s of times without intermediate insertions, you can do a single O(N) copy to a std::vector and gain from contiguous memory layout 100s of times. Test whether this is faster than a regular std::unordered_set.
  3. If you have a small number of intermediate insertions between iterations, it could pay to use a dedicated vector. If you can use Boost.Container, try boost::flat_set which offers a std::set interface with a std::vector storage back-end (i.e. a contiguous memory layout that is very cache- and prefetch friendly). Again, test whether this gives a speed-up to the other two solutions.

For the last solution, see the Boost documentation for some of the tradeoffs (it's good to be aware of all the other issues like iterator invalidation, move semantics and exception safety as well):

Boost.Container flat_[multi]map/set containers are ordered-vector based associative containers based on Austern's and Alexandrescu's guidelines. These ordered vector containers have also benefited recently with the addition of move semantics to C++, speeding up insertion and erasure times considerably. Flat associative containers have the following attributes:

  • Faster lookup than standard associative containers
  • Much faster iteration than standard associative containers
  • Less memory consumption for small objects (and for big objects if shrink_to_fit is used)
  • Improved cache performance (data is stored in contiguous memory)
  • Non-stable iterators (iterators are invalidated when inserting and erasing elements)
  • Non-copyable and non-movable values types can't be stored
  • Weaker exception safety than standard associative containers (copy/move constructors can throw when shifting values in erasures and insertions)
  • Slower insertion and erasure than standard associative containers (specially for non-movable types)

NOTE: with faster lookup, it is meant that a flat_set does O(log N) on contiguous memory rather than O(log N) pointer chasing of a regular std::set. Of course, a std::unordered_set does O(1) lookup, which will faster for large N.

Undersheriff answered 1/7, 2014 at 9:38 Comment(7)
faster lookup than hash-containers? I doubt that.Hawthorne
@Hawthorne they mean faster than std::set, which does O(log N) pointer chasing rather than O(log N) binary search on contiguous memoryUndersheriff
Well, than you should add a clarifying comment to that effect.Hawthorne
seems a bit of overkill, when a simple std::vector is sufficient.Unruh
@Unruh could be, but it is a "drop-in" replacement for std::set, so that you don't have to change your calling code.Undersheriff
This is only suitable for relatively small N, otherwise Matthieu M.'s suggestion to copy from a/n [unordered_]set to a vector is sensible.Derma
@TonyD thanks, updated to reflect different uses casesUndersheriff
L
5

I'd suggest you to use either set or unordered_set for "filtration" and when you are done, move data to vector of fixed size

Lavelle answered 1/7, 2014 at 9:30 Comment(2)
unordered_set is a bad option for "filtration" compared to set. testing if a single element is already there is linear time.Kletter
@Kletter on average, unordered_set has constant lookup complexity, compared to logarithmic of setUnruh
H
4

If building of the data-structure does not factor into the performance-concerns (or at least only marginally), consider saving your data into a std::vector: There's nothing beating it.

For speeding up the initial building of the data-structure, you might first insert into a std::unordered_set or at least use one for checking existence before insertion.

In the second case it need not contain the elements, but could contain e.g. indices.

std::vector<T> v;
auto h = [&v](size_t i){return std::hash<T>()(v[i]);};
auto c = [&v](size_t a, size_t b){return v[a] == v[b];};
std::unordered_set<size_t, decltype(h), decltype(c)> tester(0, h, c);
Hawthorne answered 1/7, 2014 at 9:34 Comment(0)
B
2

An unordered-set uses a hash-table to provide near O(1) time searching. This is done by using a hash of the key to calculate the offset of the element-you-are-seeking (keys) from the beginning of the dataset. Unless your dataset is small (like chars) different keys may have the same hash (a collision).

To minimize collisions a unordered-set will have to keep the data-store fairly sparsely populated. This means that finding a key will most be O(1) time (unless there is a collision).

However when iterating through a hash-table our iterator will encounter a lot of unused space in our datastore which will slow down the finding of the next element by our iterator. We could link adjacent elements in the hash-table with extra pointers but I do not think an unordered-set does so.

In light of the above, I will suggest you use a sorted vector for your "set". Using bisections you can search the store in O(log n) time and iterating through the list is trivial. A vector has the added advantage that the memory is contiguous so you are less likely to experience cache misses.

Bushy answered 1/7, 2014 at 10:46 Comment(0)
L
2

I highly recommend you not to use in such this case. set is binary tree, and unordered_set is hash table - so they use lots of memory, and have slow iteration speed and bad locality of reference. If you have to insert/remove/find data frequently, set or unordered_set good choice, but now you need to just read, store, sort data once and only use data many times.

In this case, sorted vector can be such a good choice. vector is dynamic array, so it has low overhead.

Just directly, see the code.

std::vector<int> data;

int input;
for (int i = 0; i < 10; i++)
{
    std::cin >> input;
    data.push_back(input); // store data
}

std::sort(data.begin(), data.end()); // sort data

That's all. All your data is ready.

If you need to remove duplicates like set, just use unique - erase after sorting.

data.erase(
    std::unique(data.begin(), data.end()),
    data.end()
    );

Notice that you should use lower_bound, upper_bound and equal_range rather than find or find_if to use the benefits of sorted data.

Lindgren answered 1/7, 2014 at 13:17 Comment(1)
Note that this implementation has a worse space complexity due to the storage of duplicates, so it might not be viable in case of a relatively large number of duplicates.Carburize

© 2022 - 2024 — McMap. All rights reserved.