vector vs map performance confusion
Asked Answered
B

2

26

edit: I am specifically comparing std::vector's linear search operations to the std::map binary search operations because that is what Herb's claim seemed to relate to. I know that using a binary search will move the performance from O(N) to O(log N) but that would not be testing Herb's claim

Bjarne Stroustrup and Herb Sutter have both recently talked about how awesome std::vector is in situations one would expect std::list to be used, owing to the cost of cache misses during linked list traversal. (see http://channel9.msdn.com/Events/Build/2014/2-661 at the 48 minute mark)

Herb made a further statement however that operations on an ordered vector were even faster than std::map, (see https://i.sstatic.net/FDjpi.png taken from the 51:30 mark of the above channel9 video) which I found difficult to fathom. So I created a small test to demonstrate this and had difficulty reproducing these results: https://ideone.com/MN7DYK

This is the test code:

template <typename C>
void test(std::string name, std::vector<int> shuffledInputValues, C & c)
{
   // fill container 'c' with values from 'shuffledInputValues' then erase them all
   {
      std::cout << "testing " << name << "..." << std::endl;
      timer t;

      for (auto val : shuffledInputValues) insert(c, val);
      for (auto val : shuffledInputValues) remove(c, val);
  }
}

// output:
// testing vector...99.189ms
// testing deque...120.848ms
// testing set...4.077ms

Notice how the std::vector performs an order of magnitude slower than std::set . Of course this is the result I expected, but I am confused about the claim that Herb was trying to make.

What am I doing wrong? Or am I misunderstanding Herb's claim?

Notes on my test app:

  • it must use linear operations - the whole point of the exercise is to demonstrate CPU cache magic, these are the constraints Herb and Bjarne put on the exercise
  • I didn't try any tricky loop-unravelling for my vector iteration, but I believe that the iteration isn't affecting performance much anyway
  • I limited the loop to 10K items because ideone times out on larger sets, but increasing the size doesn't change the results

edit: see https://ideone.com/916fVd for a modified example that only compares the performance of lookups. Linear searching exhibits the same performance.

Bernetta answered 2/7, 2014 at 23:58 Comment(23)
It's not clear from your description what types of operations they were discussing being faster; I would not expect insertion and removal to be faster, but for an ordered vector searching that data structure (with std::find) would conceivably be faster than a list. The way you paraphrase it, it sounds like he's talking about "if you start with an ordered container" vs. the act of creating an ordered container.Pose
insert & remove is not going to out scale map for a vector, that's impossible due to the packing behavior. But find ( which you don't seem to time ) is going to be a lot faster. You can't use std::find however since that will do a linear search, you have to do a binary search. If you have boost profile boost::container::flat_map against std::map and you'll see the results. Loki also have Loki::AssocVector which is pretty much the same if you have that available.Clostridium
in the video linked, herb specifically mentions searching through the structures. His exact words are (almost) 'notice on the graph the map is just above the vector, that is a vector that is doing a linear search each time' I modified the code to just measure the lookup times and the results are still in line with previous: ideone.com/916fVdBernetta
Hi @Ylisar, see ideone.com/916fVd for an example that shows the performance of linear searches vs the binary lookup.Bernetta
@Bernetta try using std::binary_search instead.Question
Hi @user657267, Herb specifically mentions linear lookups - he was demonstrating the efficiency of the CPU cache! My expectations are the same as yours, I just want to find out what Herb was talking about...Bernetta
@Bernetta to be accurate, he was specifically noting the efficiency gains due to the prefetcher in line with the linear attributes of a vector.Stockstill
@Stockstill yes I agree. But he does specifically say linear lookup operations... he makes passing mention that in the real world you'd use binary search on a sorted vector, but his claims are regarding the performance of linear iteration through a vector (unless someone can indicate otherwise? like I said I might very well be wrong!)Bernetta
@Cechner: If this lives long enough for tomorrow I'll make an example in order to try and disprove your test, as I'm at home without access to a decent dev setup :) . As is now from glancing at your code I would propose using std::find for vector when trying the linear search comparison. For one you're computing end() every iteration through the loop even if it can't change. Cont due to length restrictions...Clostridium
The upper hand which vector has against std::map is cache locality, it's reasonable to assume that your map in this simple example will have terrific locality since the whole map is filled straight up with no allocations in between. A more plausible case would perhaps be that there's "stuff" happening between every insertion into the map. In a test case such as this I would suggest filling the map with lets say 20 times as many elements as you're going to use, then randomly remove elements until you're at the same size as the vector to more closely simulate the expected "real life" usage.Clostridium
@Ylisar: good point regarding the find - I hacked the lookup together in response to Joes comment above. I have modified it, and interestingly the deque comes down to std::vector performance (massive improvement) but std::vector performance actually degrades! ideone.com/916fVdBernetta
@Ylisar: on my local machine I have tested with very large amounts of data (billions) but ideone will terminate before executing more than 10K or so due to timeouts. The performance characteristics remain the same for large data setsBernetta
@Cechner: That's very interesting, I'm intrigued and unless anything happens during the night I will put together a simple test! I actually did a benchmark a while back for boost::container::flat_map against std::map for curiosity, and did find as expected the flat_map was faster, and of course used less memory. As there's a constant penalty payed for every cache miss however I will say that for large sets std::map will always beat std::vector when the vector uses a linear search, which might be what you're experiencing.Clostridium
@Clostridium yes I agree, std::map should certainly out-perform std::vector. I think perhaps Herb discovered a flaw in his methodology or something because Bjarne's original slides do not contain the one Herb was showing, and I havent found any other reference to it on the webBernetta
My experience measuring this about 5 years ago is that trees break even with arrays in the application of a compiler symbol table at about 200 entries (with block allocation of tree nodes to reduce malloc overhead). I was programming in C and the tree implementation was mine. These were small records ~16 bytes plus the identifier string, which was also the comparison key. Symbol table operations are interspersed with lots of code and data activity that ought to challenge the cache.Tachymetry
Hehe, there's always hyperbole in these sort of speaks, always trying to strike a point home. But, I do think linked lists such as std::list should never be used, in practice I've found absolutely zero cases where it's characteristics have been preferable. If you care about order and don't update the collections often, go for boost::container::flat_map/flat_set. I you don't care about order, go for std::vector and use the swap trick on last element.Clostridium
@Clostridium well std::list and std::map dont invalidate iterators, so if you keeping long-lived iterators trumps performance concerns I imagine std::vector and boost::flat_map might not be the best fitBernetta
@Cechner: That's true, and that's why boost::container::flat_map and the ilk isn't always a drop in replacement. In my experience it's rather seldom you rely on iterator validity in the presence of insert & erase. Often you fill your collection, and after construction it stays fairly stable, in such cases a vector like storage is often a big net win. If iterator validity is important std::map/set might be the best fit.Clostridium
@Ylisar: I've needed list 1 time in my life, so I guess it has its uses haha.Exurb
Probably the most surprising (and illuminating) SO question I've seen today... I tried to see if the function call overheads (and vector.resize) overheads had anything to do with it... and is there is a memory infection point with the number of elements and I got these results: (Reserved, unreserved vectors, list and set) ... contd...Possess
... contd... (TLDR; set = waaay faster than reserved/unreserved vec, list = ~0.8x) Size: 5000 In ms, RVector 344 UVector 335 List 278 Set 2 Size: 10000 In ms, RVector 1333 UVector 1294 List 1103 Set 3 Size: 15000 In ms, RVector 3009 UVector 2934 List 2548 Set 5 Size: 20000 In ms, RVector 5342 UVector 5226 List 4538 Set 6 Size: 25000 In ms, RVector 8375 UVector 8191 List 7132 Set 8Possess
@Cechner: You're right that std::list doesn't invalidate iterators, that's why boost::stable_vector was made! All the pros of std::list, with few of the cons!Lovesick
I've modified your test to be closer to the problem as stated in one of the answers, and my results show the vector beating the set by a landslide. ideone.com/sSh2vgRives
R
8

I found the slides for easier reference (I can't see graphs, but I guess that might be because of proprietary file format). A relevant slide is number 39 which describes the problem that is being solved:

§ Generate N random integers and insert them into a sequence so that each is inserted in its proper position in the numerical order.

§ Remove elements one at a time by picking a random position in the sequence and removing the element there.

Now, it should be rather obvious that a linked list is not a good choice for this problem. Even though a list is much better than vector for inserting/removing in the beginning or in the middle, it's not good for inserting/removing in a random position because of the need for linear search. And linear search is much faster with vectors because of better cache efficiency.

Sutter suggests that a map (or a tree in general) would seem a natural choice for this algorithm because you get O(log n) search. And indeed, it does beat the vector quite easily for large N values in the insertion part.

Here comes the but. You need to remove the nth element (for random n). This is where I believe your code is cheating. You remove the elements in the order they were inserted, effectively using the input vector as a lookup table for finding value of an element at a "random" position so that you can search for it in O(log n). So you're really using a combination of set and a vector to solve the problem.

A regular binary search tree such as one used for std::map or std::set (which I assume Sutter used) doesn't have a fast algorithm for finding the nth element. Here's one which is claimed to be O(log n) on average and O(n) in the worst case. But std::map and std::set don't provide access to the underlying tree structure so for those you're stuck with in-order traversal (correct me if I'm wrong) which is a linear search again! I'm actually surprised that the map version is competitive with the vector one in Sutter's results.

For log(n) complexity, you need a structure such as Order statistic tree which is unfortunately not provided by standard library. There's GNU Policy-Based STL MAP as shown here.

Here is a quick test code I made for vector vs set vs ost (vs vector with binary search for good measure) https://ideone.com/DoqL4H Set is much slower while the other tree based structure is faster than the vector, which is not in line with Sutter's results.

order statistics tree: 15.958ms
vector binary search: 99.661ms
vector linear search: 282.032ms
set: 2004.04ms

(N = 20000, difference is only going to be greater in favor for the ost with bigger values)

In short, I came to same conclusion that Sutter's original results seem odd but for a slightly different reason. It seems to me that better asymptotic complexity wins lower constant factors this time.

Note that the problem description doesn't exclude the possibility of duplicate random values so using map / set instead of multimap / multiset is cheating a bit in the favor of the map / set, but I assume that to have only small significance when value domain is much larger than N. Also, pre-reserving the vector doesn't improve the performance significantly (around 1% when N = 20000).

Rudderhead answered 3/7, 2014 at 13:10 Comment(4)
Oh, I wrote an Order statistic tree, though I've never heard that name before.Lovesick
this is the correct answer - I didnt realise the weird (and unrealistic!) 'removal' clause (who removes random elements???). I think it was a little bit disingenuous for Herb to include this test in a presentation about cache locality, as the 'removal' clause is just showing that std::map doesnt have a random access iterator. Bjarne's original slides are more general and less deceptive in this manner. Never heard of an OST before - thanks for that!Bernetta
by the way, reddit user nullsucks also came to the same conclusion: reddit.com/r/cpp/comments/27hoi4/…Bernetta
@Bernetta Heh, I didn't know about ost either. I figured that there must be a tree structure that fares well for the problem and came across it when researching for my answer, so thanks for making this question!Rudderhead
A
-1

It's of course difficult to give a precise answer in the absence of source code plus information about compiler options, hardware, etc.

A couple of possible differences:

  • I think the speaker is talking about inserts/removals in the middle of the vector each time (whereas you always add to the end and remove from the beginning in your example);
  • The speaker makes no mention of how to determine which items are added/removed, but in that case we may as well assume that the minimum effort is made to determine this: you make a vector access each time, whereas the insert values may well simply be calculated (e.g. use a low-cost PNRG to determine the next value to insert) or always be the same, and for removal, the middle element is removed each time so no value need be looked up/calculated.

However, as another commentator has mentioned, I would take away the general principle rather than the specific number/timings. Essentially, the take-away message is: what you thought you know about "counting operations" for the sake of assessing algorithm performance/scalability is no longer true on modern systems.

Alsatia answered 3/7, 2014 at 2:4 Comment(2)
thanks for the comment, but my example inserts/removes random values from random locations... the input vector is shuffled before the timed operations are performedBernetta
Ah OK, sorry my bad -- I should have looked more closely. Nonetheless, have you tried timings with inserting/removing from the middle each time? (Although on average you'd expect the same number of "operations", again the effect we're testing is that operating-counting doesn't necessarily work as expected...)Alsatia

© 2022 - 2024 — McMap. All rights reserved.