Storing in std::map/std::set vs sorting a vector after storing all data
Asked Answered
R

3

8
  • Language: C++
  • One thing I can do is allocate a vector of size n and store all data and then sort it using sort(begin(),end()). Else, I can keep putting the data in a map or set which are ordered itself so I don't have to sort afterwards. But in this case inserting an element may be more costly due to rearrangements(I guess).

    So which is the optimal choice for minimal time for a wide range of n(no. of objects)

Revitalize answered 6/5, 2018 at 10:49 Comment(3)
This is a data structures question. In an std::set or std::map, there is a balanced tree, which is sorted all the time - each insertion/deletion costs O(logn) which means the total insertions will cost O(nlogn). While in a vector, each insertion will cost you O(1) (in average, because it sometimes must be duplicated), while each deletion will cost you O(n) and sorting it will cost you O(nlogn) each time. If you need it sorted all the time, I'd say clearly you should use a tree. Otherwise, it may be just the same as using a vector and sorting it in the endCoagulant
You need to measure it. For example: which is faster to insert random ints into and keep the container sorted a map, a list or a vector? Answer in 2 mins when I find the presentation.Astylar
Watch channel9.msdn.com/Events/Build/2014/2-661 from 45:50Astylar
C
8

It depends on the situation.

map and set are usually red-black trees, they should do a lot of work to be balanced, or the operation on it will be very slow. And it doesn't support random access. so if you only want to sort one time, you shouldn't use them.

However, if you want to continue insert elements into the container and keep order, map and set will take O(logN) time, while the sorted vector is O(N). The latter is much slower, so if you want frequently insert and delete, you should use map or set.

Crandale answered 6/5, 2018 at 11:0 Comment(4)
"The latter is much slower" not on my machine.Bubble
@Bubble How big the data is? Well, "the latter is much slower" only means complexity function grows much faster.Crandale
True, but that doesn't mean one should use a set by default instead of a vector. And to the question "how big is the data?", the answer usually is "small enough". But of course, as for many other things, it depends.Bubble
@Bubble Yes, you're right, but I also pointed that only "frequently" operate should use map or set. :)Crandale
C
3

The difference between the 2 is noticable!

Using a set, you get O(log(N)) complexity for each element you insert. So by result you get O(N log(N)), which is the complexity of an insertion sort.

Adding everything in a vector is of complexity O(1), and sorting it will be O(N log(N)) since C++11 (before it, std::sort have O(N log(N)) on average.). Once sorted, you could use binary_search to have the same complexity as in a set.

The API of using a vector as set ain't the friendly, although it does give nice performance benefits. This off course is only useful when you can do a bulk insert of data or when the amount of lookups is much larger than the manipulations of the content. Algorithmsable to sort on partially sorted vector, when you have to extend later on. Finally, one has to remark that you don't have the same guarantees of iterator invalidation.

So, why are vectors better? Cache locality! A vector has all data in a single memory block, hence the processor can do prefetching while for a set, the memory is scattered around the place requireing the data to find the next address. This makes vector a better set implementation than std::set for large data when you can live with the limitations.

To give you an idea, on the codebase I'm working on, we have several set and map implementations based on vectors which have their own narratives to function in. (For example: no erase or no operator[])

Chops answered 6/5, 2018 at 11:54 Comment(3)
Honestly, some of that went like a tangent goes past a circle :P But I think the main point is although both gives O(NLogN) complexity on paper vector is better due to more organised data in memoryRevitalize
Sorry, anything I can improve? Though, its a nice summary!Chops
No, an insertion sort is O(n^2).Tetrapod
H
0

Rules of thumb

As always it depends, both on the average size of your container and on what you do with it.

When it comes to iteration then nothing beats contiguous memory. So the less your data changes and the more you iterate it the better the vector-based solution will perform. Even look-ups should be fast if done with binary search thanks to random access iterators. (note: std::binary_search() only tests if a value exists, to access the value you need std::lower_bound() or std::upper_bound())

However, the more often you change your data the better the associative containers in the std::lib will perform. On insertion or removal the latter need to do some work to find out how to rebalance their internal tree structure, but then that's just swapping some pointers. A vector on the other hand needs to move (or copy) on average half of its data in this situation.

If in doubt, profile!

Flat associative containers

If you tend towards the vector-based solution you could give flat associative containers a try. They offer a more natural and convenient interface as a raw vector. Most notable implementations are:

  • Boost offers boost::container::flat_set and its siblings. Internally they use std::vector (or a similar data structure) and wrap them with most of the interface known from std::set etc. For obvious reasons they do not support splicing, instead, they offer members known from std::vector such as reserve(), capacity(), shrink_to_fit() and max_size(). And there are functions to extract the internal sequence or to adopt both ordered and unordered sequences. (the former allows populating a vector, sorting it once, adopting it and then accessing it with a convenient interface)
  • Similarly, C++23 brings us std::flat_set and the like which are very similar to Boost's flat containers. A major difference is that they lack reserve() which seems like a big oversight to me. But one can probably work around this by extracting the underlying sequence, call reserve on it and then adopt it again.

Contiguous pointers

One last contestant I'd like to throw in the ring is std::vector<std::unique_ptr<T>>. (or, equivalently, boost::container::stable_vector) This has less cache locality as std::vector<T> which makes iteration not as fast (pointers are contiguous and pointees should get pre-fetched, so still better than std::set). But if T is rather big then insertion and deletion become faster as only pointers need to be copied instead of having to move entire instances around.

Heavyhanded answered 25/4 at 20:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.