Utilize memory past the end of a std::vector using a custom overallocating allocator

F

2

1

Let's say I have an allocator my_allocator that will always allocate memory for n+x (instead of n) elements when allocate(n) is called.

Can I savely assume that memory in the range [data()+n, data()+n+x) (for a std::vector<T, my_allocator<T>>) is accessible/valid for further use (i.e. placement new or simd loads/stores in case of fundamentals (as long as there is no reallocation)?

Note: I'm aware that everything past data()+n-1 is uninitialized storage. The use case would be a vector of fundamental types (which do not have a constructor anyway) using the custom allocator to avoid having special corner cases when throwing simd intrinsics at the vector. my_allocator shall allocate storage that is 1.) properly aligned and has 2.) a size that is a multiple of the used register size.

To make things a little bit more clear:

Let's say I have two vectors and I want to add them:

std::vector<double, my_allocator<double>> a(n), b(n);
// fill them ...
auto c = a + b;
assert(c.size() == n);

If the storage obtained from my_allocator now allocates aligned storage and if sizeof(double)*(n+x) is always a multiple of the used simd register size (and thus a multiple of the number of values per register) I assume that I can do something like

for(size_t i=0u; i<(n+x); i+=y) 
{ // where y is the number of doubles per register and and divisor of (n+x)
    auto ma = _aligned_load(a.data() + i);
    auto mb = _aligned_load(b.data() + i);
    _aligned_store(c.data() + i,  _simd_add(ma, mb)); 
}

where I don't have to care about any special case like unaligned loads or backlog from some n that is not dividable by y.

But still the vectors only contain n values and can be handled like vectors of size n.

Flivver answered 15/10, 2016 at 2:33 Comment(5)

Why would you want to? If you need n+x space, set the size to n+x. – Detrusion 15/10, 2016 at 7:11

@ChrisDodd: I need vectors of size n. I just want to make some sort of loop unrolling without making sure to not overshoot (or having to check if there is anything left to be processed in a non-unrolled fashion). – Flivver 15/10, 2016 at 12:58

What about just using reserve (prior to your unrolling op, e.g. reserve(n+x)) instead of messing up with the allocator?. And if you're not storing anything useful after size, why i<(n+x) instead of simply i<n? – Claudell 15/10, 2016 at 13:40

@FranMowinckel: Because I do not want the possible reallocation overhead and I would like to hide that as an implementation detail. Yes i<n is exactly equivalent in this case. That code is not actual code but just an illustration. – Flivver 15/10, 2016 at 14:14

@Flivver - if you don't want special handling for the tail of the loop, the way most high-performance implementations deal with this today is to (a) ensure the start of the region is reasonably aligned (e.g., to a 16-byte boundary for SSE, or 32-byte for AVX), and then (b) to read beyond the allocated region in the last iteration, ignoring any data outside the allocated region. This is safe in practice, on particular platforms/compilers, and widely used. I added more details in my answer below. – Pickering 30/10, 2016 at 1:37

C

1

For your use case you shouldn't have any doubts. However, if you decide to store anything useful in the extra space and will allow the size of your vector to change during its lifetime, you will probably run into problems dealing with the possibility of reallocation - how are you going to transfer the extra data from the old allocation to the new allocation given that reallocation happens as a result of separate calls to allocate() and deallocate() with no direct connection between them?

EDIT (addressing the code added to the question)

In my original answer I meant that you shouldn't have any problem accessing the extra bytes allocated by your allocator in excess of what was requested. However, writing data in the memory range, that is outside the range currently utilized by the vector object but belongs to the range that would be spanned by the unmodified allocation, asks for trouble. An implementation of std::vector is free to request from the allocator more memory than would be exposed through its size()/capacity() functions and store auxiliary data in the unused area. Though this is highly theoretical, not accounting for that possibility means opening a door into undefined behavior.

Consider the following possible layout of the vector's allocation:

---====================++++++++++------.........

=== - used capacity of the vector
+++ - unused capacity of the vector
--- - overallocated by the vector (but not shown as part of its capacity)
... - overallocated by your allocator

You MUST NOT write anything in the regions 2 (---) and 3 (+++). All your writes must be constrained to the region 4 (...), otherwise you may corrupt important bits.

Cystic answered 15/10, 2016 at 5:37 Comment(2)

I do not want to actually store something there. If what I assume is correct, I would (and hopefully could) make sure that I can handle or prevent the reallocation case but for the time being that is not the intent. Currently I want to perform loop unrolling. – Flivver 15/10, 2016 at 13:16

@Flivver See updated answer - in fact the way you are going to unroll the loop contains UB threats. – Cystic 15/10, 2016 at 14:12

P

3

Stepping back a moment, if the problem you are trying to solve is to allow the underlying memory to be processed effectively by SIMD intrinsics or unrolled loops, or both, you don't necessarily need to allocate memory beyond the used amount just to "round off" the allocation size to a multiple of vector width.

There are various approaches used to handle this situation, and you mentioned a couple, such as special lead-in and lead-out code to handle the leading and trailing portions. There are actually two distinct problems here - handling the fact the data isn't a multiple of the vector width, and handling (possibly) unaligned starting addresses. Your over-allocation method is tackling the first issue - but there's probably a better way...

Most SIMD code in practice can simply read beyond the end of the processed region. Some might argue that this is technically UB - but when using SIMD intrinsics you are already venturing beyond the walls of Standard C++. In fact, this technique is already widely used in the standard library and so it is implicitly endorsed by compiler and library maintainers. It is also a standard method for handling SIMD codes in general, so you can be pretty sure it's not going to suddenly break.

They key to making it work is the observation that if you can validly read even a single byte at some location N, then any a naturally aligned read of any size¹ won't trigger a fault. Of course, you still need to ignore or otherwise handle the data you read beyond the end of the officially allocated area - but you'll need to do that anyway with your "allocate extra" approach, right? Depending on the algorithm, you may mask away the invalid data, or exclude invalid data after the SIMD portion is done (i.e., if you are searching for a byte, if you find a byte after the allocated area, it's the same as "not found").

To make this work, you need to be reading in an aligned fashion, but that's probably something you already want to do I think. You can either arrange to have your memory allocated aligned in the first place, or do an overlapping read at the start (i.e., one unaligned read first, then all aligned with the first aligned read overlapping the unaligned portion), or use the same trick as the tail to read before the array (with the same reasoning as to why this is safe). Furthermore, there are various tricks to request aligned memory without needing to write your own allocator.

Overall, my recommendation is to try to avoid writing a custom allocator. Unless the code is fairly tightly contained, you may run into various pitfalls, including other code making wrong assumptions about how your memory was allocated and the various other pitfalls Leon mentions in his answer. Furthermore, using a custom allocator disables a bunch of optimizations used by the standard container algorithms, unless you use it everywhere, since many of them apply only to containers using the same allocator.

Furthermore, when I was actually implementing custom allocators² , I found that it was a nice idea in theory, but a bit too obscure to be well-supported in an identical fashion across all the compilers. Now the compilers have become a lot more compliant over time (I'm looking mostly at you, Visual Studio), and template support has also improved, so perhaps that's not an issue, but I feel it still falls into the category of "do it only if you must".

Keep in mind also that custom allocators don't compose well - you only get the one! If someone else on your project wants to use a custom allocator for your container for some other reason, they won't be able to do it (although you could coordinate and create a combined allocator).

This question I asked earlier - also motivated by SIMD - covers a lot of the ground about the safety of reading past the end (and, implicitly, before the beginning), and is probably a good place to start if you are considering this.

¹ Technically, the restriction is any aligned read up to the page size, which at 4K or larger is plenty for any of the current vector-oriented general purpose ISAs.

² In this case, I was doing it not for SIMD, but basically to avoid malloc() and to allow partially on-stack and contiguous fast allocations for containers with many small nodes.

Pickering answered 30/10, 2016 at 1:28 Comment(0)

C

1

For your use case you shouldn't have any doubts. However, if you decide to store anything useful in the extra space and will allow the size of your vector to change during its lifetime, you will probably run into problems dealing with the possibility of reallocation - how are you going to transfer the extra data from the old allocation to the new allocation given that reallocation happens as a result of separate calls to allocate() and deallocate() with no direct connection between them?

EDIT (addressing the code added to the question)

In my original answer I meant that you shouldn't have any problem accessing the extra bytes allocated by your allocator in excess of what was requested. However, writing data in the memory range, that is outside the range currently utilized by the vector object but belongs to the range that would be spanned by the unmodified allocation, asks for trouble. An implementation of std::vector is free to request from the allocator more memory than would be exposed through its size()/capacity() functions and store auxiliary data in the unused area. Though this is highly theoretical, not accounting for that possibility means opening a door into undefined behavior.

Consider the following possible layout of the vector's allocation:

---====================++++++++++------.........

=== - used capacity of the vector
+++ - unused capacity of the vector
--- - overallocated by the vector (but not shown as part of its capacity)
... - overallocated by your allocator

You MUST NOT write anything in the regions 2 (---) and 3 (+++). All your writes must be constrained to the region 4 (...), otherwise you may corrupt important bits.

Cystic answered 15/10, 2016 at 5:37 Comment(2)

I do not want to actually store something there. If what I assume is correct, I would (and hopefully could) make sure that I can handle or prevent the reallocation case but for the time being that is not the intent. Currently I want to perform loop unrolling. – Flivver 15/10, 2016 at 13:16

@Flivver See updated answer - in fact the way you are going to unroll the loop contains UB threats. – Cystic 15/10, 2016 at 14:12

Recommended topics

Hot tags