Cache-friendliness std::list vs std::vector

Asked 8/11, 2016 at 14:31 Answered 20/4, 2022 at 10:24

Solved c++list c++11 vector visual-studio-2015

With CPU caches becoming better and better std::vector usually outperforms std::list even when it comes to testing the strengths of a std::list. For this reason, even for situations where I need to delete/insert in the middle of the container I usually pick std::vector but I realized I had never tested this to make sure assumptions were correct. So I set up some test code:

#include <iostream>
#include <chrono>
#include <list>
#include <vector>
#include <random>

void TraversedDeletion()
{
    std::random_device dv;
    std::mt19937 mt{ dv() };
    std::uniform_int_distribution<> dis(0, 100000000);

    std::vector<int> vec;
    for (int i = 0; i < 100000; ++i)
    {
        vec.emplace_back(dis(mt));
    }

    std::list<int> lis;
    for (int i = 0; i < 100000; ++i)
    {
        lis.emplace_back(dis(mt));
    }

    {
        std::cout << "Traversed deletion...\n";
        std::cout << "Starting vector measurement...\n";

        auto now = std::chrono::system_clock::now();
        auto index = vec.size() / 2;
        auto itr = vec.begin() + index;
        for (int i = 0; i < 10000; ++i)
        {
            itr = vec.erase(itr);
        }

        std::cout << "Took " << std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now() - now).count() << " μs\n";
    }

    {
        std::cout << "Starting list measurement...\n";

        auto now = std::chrono::system_clock::now();
        auto index = lis.size() / 2;
        auto itr = lis.begin();
        std::advance(itr, index);
        for (int i = 0; i < 10000; ++i)
        {
            auto it = itr;
            std::advance(itr, 1);
            lis.erase(it);
        }

        std::cout << "Took " << std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now() - now).count() << " μs\n";
    }

}

void RandomAccessDeletion()
{
    std::random_device dv;
    std::mt19937 mt{ dv() };
    std::uniform_int_distribution<> dis(0, 100000000);

    std::vector<int> vec;
    for (int i = 0; i < 100000; ++i)
    {
        vec.emplace_back(dis(mt));
    }

    std::list<int> lis;
    for (int i = 0; i < 100000; ++i)
    {
        lis.emplace_back(dis(mt));
    }

    std::cout << "Random access deletion...\n";
    std::cout << "Starting vector measurement...\n";
    std::uniform_int_distribution<> vect_dist(0, vec.size() - 10000);

    auto now = std::chrono::system_clock::now();

    for (int i = 0; i < 10000; ++i)
    {
        auto rand_index = vect_dist(mt);
        auto itr = vec.begin();
        std::advance(itr, rand_index);
        vec.erase(itr);
    }

    std::cout << "Took " << std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now() - now).count() << " μs\n";

    std::cout << "Starting list measurement...\n";

    now = std::chrono::system_clock::now();

    for (int i = 0; i < 10000; ++i)
    {
        auto rand_index = vect_dist(mt);
        auto itr = lis.begin();
        std::advance(itr, rand_index);
        lis.erase(itr);
    }

    std::cout << "Took " << std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now() - now).count() << " μs\n";
}

int main()
{
    RandomAccessDeletion();
    TraversedDeletion();
    std::cin.get();
}

All results are compiled with /02 (Maximize speed).

The first, RandomAccessDeletion(), generates a random index and erases this index 10.000 times. My assumptions were right and the vector is indeed a lot faster than the list:

Random access deletion...

Starting vector measurement...

Took 240299 μs

Starting list measurement...

Took 1368205 μs

The vector is about 5.6x faster than the list. We can most likely thank our cache overlords for this performance benefit, even though we need to shift the elements in the vector on every deletion it's impact is less than the lookup time of the list as we can see in the benchmark.

So then I added another test, seen in the TraversedDeletion(). It doesn't use randomized positions to delete but rather it picks an index in the middle of the container and uses that as base iterator, then traverse the container to erase 10.000 times.

My assumptions were that the list would outperform the vector only slightly or as fast as the vector.

The results for the same execution:

Traversed deletion...

Starting vector measurement....

Took 195477 μs

Starting list measurement...

Took 581 μs

Wow. The list is about 336x faster. This is really far off from my expectations. So having a few cache misses in the list doesn't seem to matter at all here as cutting the lookup time for the list weighs in way more.

So the list apparently still has a really strong position when it comes to performance for corner/unusual cases, or are my test cases flawed in some way?

Does this mean that the list nowadays is only a reasonable option for lots of insertions/deletions in the middle of a container when traversing or are there other cases?

Is there a way I could change the vector access & erasure in TraversedDeletion() to make it at least a bit more competition vs the list?

In response to @BoPersson's comment:

vec.erase(it, it+10000) would perform a lot better than doing 10000 separate deletes.

Changing:

for (int i = 0; i < 10000; ++i)
{
    itr = vec.erase(itr);
}

To:

vec.erase(itr, itr + 10000);

Gave me:

Starting vector measurement...

Took 19 μs

This is a major improvement already.

Distinguished answered 8/11, 2016 at 14:31 Comment(13)

Vector erase test exhibits undefined behavior. vec.erase(it) invalidates itr. You want itr = vec.erase(itr); – Sinister 8/11, 2016 at 14:35

depends on what "erase" does on a vector. it most likely copies the memory into a new memory location (except the removed element) so you have no cache friendliness on a vector erase. And also lots of "copying" – Distichous 8/11, 2016 at 14:36

@NathanOliver No he does not. I'm not sure what you are talking about. – Sinister 8/11, 2016 at 14:37

@IgorTandetnik Oops. I looked at that wrong. vec.erase(it) does ivalidate itr and he is using that in each subsequent iteration. I was looking at RandomAccessDeletion not TraversedDeletion. – Bolduc 8/11, 2016 at 14:39

@Hayt: all iterators before the erased one are not invalidated - the allocated memory area is the same, the elements are just "shifted" back to take place of the erased one. – Loanloanda 8/11, 2016 at 14:39

@IgorTandetnik Whoops, nice catch, let me remeasure and edit. – Distinguished 8/11, 2016 at 14:40

@GillBates: you can use the return value of vector::erase to get a valid itr – Loanloanda 8/11, 2016 at 14:41

@VittorioRomeo ah thx :) – Distichous 8/11, 2016 at 14:41

@IgorTandetnik Edited and factored out the UB. – Distinguished 8/11, 2016 at 14:42

vec.erase(it, it+10000) would perform a lot better than doing 10000 separate deletes. – Pigeonhole 8/11, 2016 at 14:43

@BoPersson Updated question with your suggestion, the performance benefit seems almost unreal, how come this is so much faster? – Distinguished 8/11, 2016 at 14:49

Erasing a range means elements only need to move once, so about 40,000 copies. Erasing one by one means the same 40,000-50,000 elements need to move 10,000 times each, for a total of about 450,000,000 copies. – Sinister 8/11, 2016 at 14:51

@IgorTandetnik Ah, of course, the moving only happens once with a single erasure, thanks. – Distinguished 8/11, 2016 at 14:51

In TraversedDeletion you are essentially doing a pop_front but instead of being at the front you are doing it in the middle. For a linked list this is not an issue. Deleting the node is a O(1) operation. Unfortunately when you do this in the vector is it a O(N) operation where N is vec.end() - itr. This is because it has to copy every element from deletion point forward one element. That is why it is so much more expensive in the vector case.

On the other hand in RandomAccessDeletion you are constantly changing the delete point. This means you have an O(N) operation to traverse the list to get to the node to delete and a O(1) to delete the node versus a O(1) traversersal to find the element and a O(N) operation to copy the elements in the vector forward. The reason this is not the same though is the cost to traverse from node to node has a higher constant than it takes to copy the elements in the vector.

Bolduc answered 8/11, 2016 at 14:52 Comment(0)

The long duration for list in RandomDeletion is due to the time it takes to advance from the beginning of the list to the randomly selected element, an O(N) operation.

TraverseDeletion just increments an iterator, an O(1) operation.

Drops answered 8/11, 2016 at 14:38 Comment(8)

TraversedDeletion also includes std::advance(itr, index); into the measurement. I would be curious which portion of the time goes into this, vs the actual deletion. – Sinister 8/11, 2016 at 14:41

@IgorTandetnik Traverse deletion advances by 1. Random deletion advances by N, where N is between 0 and 10,000 (average 5,000). To advance in a list you need to walk the list elements N times, so the random deletion needs to do lots more work to get to the element to delete. – Drops 8/11, 2016 at 14:43

It advances by 1 most of the time - but there's a one-time setup call to move the iterator to the middle of the list, and it's included in the measurement. – Sinister 8/11, 2016 at 14:45

@IgorTandetnik But that's one time, not 10,000 times. – Drops 8/11, 2016 at 14:46

But it requires traversing 50,000 nodes, not 10,000 nodes. There's a loop built into std::advance – Sinister 8/11, 2016 at 14:47

@IgorTandetnik 50000 + N is much less then 5000 * 10000 – Drops 8/11, 2016 at 14:50

You are comparing RandomDeletion with TraversedDeletion. I'm not - there's no doubt that the former is much more inefficient on a list. All I'm saying is that TraversedDeletion performs two operations, one of which has nothing to do with erase proper, but it's nevertheless included in the test's measurement. I'd be curious to know what percentage of time measured for TraversedDeletion is taken by that single expensive advance call, as opposed to erase loop. I'm not at all arguing against your statement - I suspect we are in a violent agreement. – Sinister 8/11, 2016 at 14:54

@IgorTandetnik In the random access loop, you have 10,000 loop iterations, with an average advance of 5,000 every loop. Therefore an average of 50,000,000 elements to traverse vs 50,000 + 10,000 == 60,000 for the traversed deletion. – Drops 8/11, 2016 at 14:57

The "fast" part about a vector is "reaching" the element which needs to be accessed (traversing). You don't actually traverse much on the vector in the deletion but only access the first element. ( I would say the adavance-by-one does not make much measurement wise)

The deletion then takes quite a lot of time ( O(n) so when deleting each one by itself it's O(n²) ) due to changing the elements in the memory. Because the deletion changes the memory on the locations after the deleted element you also cannot benefit from prefetching which also is a thing which makes the vector that fast.

I am not sure how much the deletion also would invalidate the caches because the memory beyond the iterator has changed but this can also have a very big impact on the performance.

Distichous answered 8/11, 2016 at 14:50 Comment(0)

In the first test, the list had to traverse to the point of deletion, then delete the entry. The time the list took was in traversing for each deletion.

In the second test, the list traversed once, then repeatedly deleted. The time taken was still in the traversal; the deletion was cheap. Except now we don't repeatedly traverse.

For the vector, traversal is free. Deletion takes time. Randomly deleting an element takes less time than it took for the list to traverse to that random element, so vector wins in the first case.

In the second case, the vector does the hard work many many more times than the list does it hard work.

But, the problem is that isn't how you should traverse-and-delete from a vector. It is an acceptable way to do it for a list.

The way you'd write this for a vector is std::remove_if, followed by erase. Or just one erase:

  auto index = vec.size() / 2;
  auto itr = vec.begin() + index;
  vec.erase(itr, itr+10000);

Or, to emulate a more complex decision making process involving erasing elements:

  auto index = vec.size() / 2;
  auto itr = vec.begin() + index;
  int count = 10000;
  auto last = std::remove_if( itr, vec.end(),
    [&count](auto&&){
      if (count <= 0) return false;
      --count;
      return true;
    }
  );
  vec.erase(last, vec.end());

Almost the only case where list is way faster than vector is when you store an iterator into the list, and you periodically erase at or near that iterator while still traversing the list between such erase actions.

Almost every other use case has a vector use-pattern that matches or exceeds list performance in my experience.

The code cannot always be translated line-for-line, as you have demonstrated.

Every time you erase an element in a vector, it moves the "tail" of the vector over 1.

If you erase 10,000 elements, it moves the "tail" of the vector over 10000 in one step.

If you remove_if, it removes the tail over efficiently, gives you the "wasted" remaining, and you can then remove the waste from the vector.

Analisaanalise answered 8/11, 2016 at 18:10 Comment(0)

I want po point out something still not mentioned in this question:

In the std::vector, when you delete an element in the middle, the elements are moved thanks to new move semantics. That is one of the reasons the first test takes this speed, because you are not even copying the elements after the deleted iterator. You could reproduce the experiment with a vector and list of non-copiable type and see how the performance of the list (in comparation) is better.

Ambrosine answered 10/11, 2016 at 21:33 Comment(1)

1. It's a vector<int>, move and copy operations are exactly the same for int (or any primitive data types for that matter), no speed up through move semantics. 2. A vector is contiguous, if you delete an element in the middle all elements after it MUST be moved/copied over to fill the gap, the type MUST be copiable. Non-copiable doesn't make sense here. – Devonian 21/11, 2016 at 14:46

I would suggest to run the same tests using a more complex data type in the std::vector: instead of an int, use a structure.

Even better use a static C array as a vector element, and then take measurements with different array sizes.

So, you could swap this line of your code:

std::vector<int> vec;

with something like:

const size_t size = 256;
struct TestType { int a[size]; };
std::vector<TestType> vec;

and test with different values of size. The behavior may depend on this parameter.

Forgive answered 20/4, 2022 at 10:24 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags