How is vector<vector<int>> "heavier" than vector<pair<int,int>>?
Asked Answered
S

7

69

During a recent interview, I suggested using vector<pair<int,int>> over vector<vector<int>> since we only wanted to store two values for every entry in the vector. I said something to the tune of "we should use vector<pair<int,int>> over vector<vector<int>> since the latter is heavier than the former".

After the coding session was over, they said it was a good idea to use pair over a vector and asked me to elaborate what I meant by "heavier" earlier. I wasn't able to elaborate, unfortunately. Yes, I know we can enter only two values in a pair but many more in a vector and that vector gets resized automatically when its size==capacity, etc. but how should I have answered their question - why specifically was using vector<pair<int,int>> better than vector<vector<int>>? What extra things are done in the latter case?

Sawfish answered 24/7, 2022 at 1:39 Comment(8)
A vector has to deal with variable size, and the data goes on the heap. A pair has zero overhead since the size is fixed.Penetration
Fewer indirections, better cache locality.Paramagnetism
@Paramagnetism could you please elaborate?Sawfish
If one vector<int> is heavier than one std::pair<int,int>, then a vector of the former would be heavier than a vector of the latter.Shapely
With vector<vector<int>> you need a dynamic memory allocation for each pair, in addition to the allocation for the outer vector. Dynamic memory allocation is not generally fast, and the result can have poor memory locality (consecutive elements may not be near each other in memory). Modern computer architecture likes to access objects which are near other objects it has accessed recently, and can run orders of magnitude faster when that is the case. With vector<pair<int,int>> all of the elements are consecutive, which will help when you have to work on the whole container.Arrest
Also, usually even sizeof(std::pair<int, int>) < sizeof(std::vector<int>) as well, but this is not so important compared to the time overhead of dynamic allocation and the memory locality issues mentioned in the comments above.Expediential
If you can't say what "heavier" means, why did you describe it that way? By not being able to elaborate, the interviewers might reasonably have believed that you might seem to talk a reasonable game, but don't necessarily have understanding to back it up. In any event, reasonable responses might have been to compare sizeof(std::vector<int>) with sizeof(std::pair<int, int>), to talk about how std::vector has a number of members it uses to manage memory while std::pair is essentially a simple data structure that contains two members and does no memory management.Companionway
i.kym-cdn.com/entries/icons/original/000/026/136/…Hormonal
S
78

Each vector is a single contiguous area of memory, dynamically allocated.

Let's say that you have 1000 values you'll be working with.

std::vector<std::pair<int, int>>

This gets you a single, contiguous block of memory, for 2000 integers.

std::vector<std::vector<int>>

This gets you a single contiguous block of memory for 1000 vectors.

Each one of those 1000 std::vectors gets you another contiguous block of memory for just two integers.

So, instead of one single contiguous block of memory, for this data structure, it will consist of 1001 blocks of memory scattered all over. You have no guarantees, whatsoever, that all those blocks of memory will be contiguous, one after another.

Each dynamic memory allocation comes at a cost. The cost is fairly small but it adds up very, very quickly. A single penny is easily ignored. A thousand pennies should be enough to get you a cup of coffee at Starbucks.

Furthermore, modern CPUs are very, very good at accessing contiguous blocks of memory. Iterating over a single contiguous block of memory to add up two thousand ints will be much, much faster than doing the same over a thousand disjointed sections of memory.

Strung answered 24/7, 2022 at 1:59 Comment(2)
One note on the costs. If you already are doing it 99 places, adding one more costs relatively little. (Else nobody would program in Python!) But getting from 3 to 2 performance mistakes has a big impact, and going from 2 to 1 is bigger. And that last, well...Errantry
On top of that, you have 1000 control-blocks for vector<int> objects, each of those being the size of 3 pointers (on normal implementations), pointing to those 1000 scattered allocations. On a typical 64-bit system, that's 24 bytes of overhead for every 8 bytes of data, on top of the dynamic allocation bookkeeping data, which is probably at least 8 bytes per allocation. Probably more, especially on systems where alignof(max_align_t) is 16 so each allocation is 16-byte aligned. And yeah, all this indirection is bad for compile-time SIMD optimization, and for CPUs if actually scattered.Gennie
P
30

You can answer this without reference to any particular language. The problem called for storing a sequence of 2-tuples. Your chosen type should be capable of storing 2-tuples, of course, but also be incapable of storing tuples of other sizes. So given two types that are both capable of storing the desired values, prefer the one that is less capable of storing undesired values.

vector<int> would allow you to store 2-element vectors, but also empty vectors, singleton vectors, 3-element vectors, 4-element vectors, etc. pair<int,int> is more precise, since it can only store exactly two values.

(Not to discount the performance benefits mentioned in the accepted answer, only to provide a purely semantic argument for using precise types.)

Partridge answered 24/7, 2022 at 12:23 Comment(2)
Exactly. The performance is better because the types are more precise, so we can use more specific algorithms and data structures instead of more general-purpose ones (for example simply storing two integers instead of storing a pointer to an array of arbitrarily many integers). Using more precise types also expresses intent better.Devolution
std::array<int,2> would also have been totally fine and performed the same, since it holds the values inside the array object itself, not pointers to them. (Same object-representation as std::pair<int,int>). Your argument applies equally to it; the ,2 size is part of the type, not runtime-variable data. So even though std::array in general can hold any number of ints, that instantiation of the template can hold exactly two, same as a pair. (And they have to both be the same type, while a pair supports different types. But pair<int,int> doesn't.)Gennie
L
18

As others mentioned, std::vector<int> adds for example a counter of the number of elements.

But an interesting aspect you could have suggested in the interview would be to use std::array<int, 2>. It should have a similar cost as std::pair<int, int> as it will store the numbers in a fixed-sized array. One advantage would be the API, which allows to use a[0] instead of a.first and also is easier to generalize when you may need to store, for example, three values per entry after some new features was added.

Levan answered 24/7, 2022 at 20:4 Comment(2)
Yup, and a normal implementation of it will have the same object-representation as std::pair<int,int>, and should compile to exactly the same asm. Use whichever is semantically more meaningful for your use-case; foo[i][0] and foo[i][1] is good if the two ints have similar meanings; foo[i].first / .second is maybe good if they're dissimilar. Or you could use an enum or wrapper class to give meaningful names to the array indices. (Or probably just use a custom struct instead of pair<> or array<> if you want to have meaningful member names!)Gennie
The advantage of foo.second and std::get<1>(foo) over foo[1] is that the former is checked for oob at compile time while foo[1] can be oob at runtime. You can also just use std::tuple if you want to extend later.Briarroot
C
14

To simplify the explanation, let's say that

  • A[ a | b ] B[ c ] means: a and b are in chunk A and c in chunk B.
  • Chunks here are continuous pieces of memory, so a is next to b

With that in mind, lets see an example: the memory usage of { { 1, 1 } , { 2, 2 }, ... }

For std::vector<<std::vector<int>>

  • A[ size info | ptr to B ]
  • B[ [ size info | ptr to C ] | [ size info | ptr to D ] | ... ]
  • C[ 1 | 1 ]
  • D[ 2 | 2 ]

For std::vector<std::pair<int, int>>

  • A[ size info | ptr to B ]
  • B[ [ 1 | 1 ] | [ 2 | 2 ] | ... ]

I think the example is very clear: there is one layer of indirection less when doing std::vector<std::pair<int, int>>. Meaning

  1. There is less memory consumption (you don't need extra variables for the size and a pointer to a chunk for each element).
  2. To get a desired value you would do less steps (otherwise, you would have to first load and read the pointer and then with that address load the desired value).
Codfish answered 25/7, 2022 at 0:30 Comment(9)
Good answer. I'd just add that, strictly speaking, std::vector is not required to consist of just pointer to the data and size (standard doesn't mandate such implementation details) and usually also contains at least its capacity to track whether the vector used all currently allocated space and needs reallocation during, e.g., a push_back. Also, using more dynamic allocation/deallocation in vector of vectors case is in itself a massive additional overhead not just because more memory is used, but also because this process is pretty slow.Psychodiagnostics
@YurkoFlisk: A normal std::vector implementation has 3 pointers: data, end of allocation (.reserve() or automatic growth), and end of "in use" area (.size() = .end() - .begin()). An implementation could use a pointer and two size_t members, but in practice mainstream implementations use pointers. Either way, sizeof(std::vector<T>) is 24 bytes on a typical 64-bit system (or 12 on a 32-bit system), while std::array<int,2> or std::pair<int,int> are each 8 bytes.Gennie
@YurkoFlisk: So std::vector wastes 24 bytes for each 8 of real data, more if you count dynamic allocation bookkeeping. Not to mention potentially scattering your data around, and introducing a level of indirection.Gennie
@PeterCordes Thanks for correction about typical vector implementation. In any case, implementation using only 2 pointer/size fields (pointer to data and size/(pointer to end)) would probably be unable to be standard conformant regarding either correctness or average complexity of operations (it would need to keep "capacity" equal to size in general case and, at best, possibly some "small vector" optimization). About additional level of indirection, it was mentioned in the answer.Psychodiagnostics
@Psychodiagnostics Weirdly, though, I only find the complexity requirements "at most linear" and "amortized constant" for std::vector<>::reserve() and std::vector<>::emplace_back(), respectively. As such, an implementation could indeed ignore calls to std::vector<>::reserve() and derive the current allocation size as a deterministic function of std::vector<>::size(). For instance, it could always round std::vector<>::size() up to the next power of 2, providing exactly the expected complexity, while avoiding explicitly storing an allocated size.Rout
@Psychodiagnostics That said, it would still be nonsense to "optimize away" that explicit allocated size field: Since we are already talking about dynamic memory allocation here, storing a single additional pointer variable is not a relevant cost. However, having that additional variable may easily avoid many reallocation operations that would result from having to shrink the allocation when deleting elements from the vector (and subsequently more reallocations when the vector grows again).Rout
@cmaster-reinstatemonica Interesting idea, but according to vector.overview "A vector is a sequence container that supports (amortized) constant time insert and erase operations at the end", and your implementation seems to fail to conform to this regarding sequence of alternating push_back and pop_back operations applied to power-of-two sized vector. At least if the wording is to be interpreted that they are amortized "jointly".Psychodiagnostics
@cmaster-reinstatemonica Also, according to vector.capacity.4 reserve(n) can't be a no-op, because after this call capacity() has to be greater or equal than n.Psychodiagnostics
@Psychodiagnostics Ah, that detail escaped my notice. Ok, so the capacity field seems to be required. I find it always strange when the wording of a standard is kept implementation agnostic, but puts up such strict bounds that only a single implementation actually fits the bill. I mean, they could just have said that std::vector<> is required to store both its size and capacity, and that it is required to increment its capacity by at least a constant factor. Would make life easier for everyone, imho.Rout
M
7

A vector is a dynamically-resizing array. You sacrifice some performance to get the ability to resize dynamically.

A vector of vectors (vector<vector<int>>) has that performance overhead for both the outer vector and each of its elements. With a vector of pairs (vector<pair<int, int>>), you don't have the latter. A pair is always of a fixed size, so you needn't worry about having to resize it as needed (and relocate it to another position in memory if needed).

Meanly answered 25/7, 2022 at 13:28 Comment(0)
I
4

My "simple" / "naive" answer would be:

A vector<pair<int, int>> knows that it will always be pairs ints, so can allocate memory accordingly (e.g. when the vector resizes), possible in one continous chunk. Also it only needs to keep track that it stores X pairs of ints, enabling fast access to those ints and keeping overhead to the minimum. Finally with that information available at compile time the compiler can (possibly) optimize the code better.

A vector<vector<int>> needs to be able to store X-times * any number of int. It is likely that the outer vector only stores the adresses of the inner vector (to facilitate fast access), which means your data is likely to be scattered all over the memory. Also the inner vectors need to keep track of the numbers of ints they contain (even though this number should always be two), adding unnecessary overhead to both storing and acessing the ints. Finally the compiler can make fewer assumptions about the structure of your data, reducing the potential for optimizations.

Intention answered 26/7, 2022 at 7:51 Comment(0)
I
2

Lean is beautiful: An std::pair<int, int> corresponds two exactly two integers. And that's exactly what you wanted: not more, not less.

And it's performant: There is no overhead; the C++ standard defines the pair as a simple struct. So no memory management overhead and direct access to the member, since everything that could be time consuming is prepared at compile time.

Here an example, for initialising a pair<int,int> and calling a function with its reference:

void test1(int a, int b) {
    auto x = std::make_pair(a,b);
    f(x);   
}

And here the code generated with gcc and global optimizer:

    sub     rsp, 24
    mov     DWORD PTR [rsp+8], edi
    lea     rdi, [rsp+8]
    mov     DWORD PTR [rsp+12], esi
    call    f(std::pair<int, int>&)
    add     rsp, 24
    ret

In comparison, doing the same with vector<int> generates 31 lines of assembler because of the dynamic allocation, but also the need to cope with allocation errors, and of course a more complex destruction when the vector is no longer needed. See here for the full details.

(To complete the picture, some algorithms can take advantage of this simplicity and offer a pair specialization)

Illfounded answered 31/7, 2022 at 18:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.