Modern approach to making std::vector allocate aligned memory
Asked Answered
L

2

23

The following question is related, however answers are old, and comment from user Marc Glisse suggests there are new approaches since C++17 to this problem that might not be adequately discussed.

I'm trying to get aligned memory working properly for SIMD, while still having access to all of the data.

On Intel, if I create a float vector of type __m256, and reduce my size by a factor of 8, it gives me aligned memory.

E.g. std::vector<__m256> mvec_a((N*M)/8);

In a slightly hacky way, I can cast pointers to vector elements to float, which allows me to access individual float values.

Instead, I would prefer to have an std::vector<float> which is correctly aligned, and thus can be loaded into __m256 and other SIMD types without segfaulting.

I've been looking into aligned_alloc.

This can give me a C-style array that is correctly aligned:

auto align_sz = static_cast<std::size_t> (32);
float* marr_a = (float*)aligned_alloc(align_sz, N*M*sizeof(float));

However I'm unsure how to do this for std::vector<float>. Giving the std::vector<float> ownership of marr_a doesn't seem to be possible.

I've seen some suggestions that I should write a custom allocator, but this seems like a lot of work, and perhaps with modern C++ there is a better way?

Loveinidleness answered 11/2, 2020 at 13:19 Comment(3)
without segfaulting... or without potential slowdowns from cache-line splits when you use _mm256_loadu_ps(&vec[i]). (Although note that with default tuning options, GCC splits not-guaranteed-aligned 256-bit loads/stores into vmovups xmm / vinsertf128. So there is an advantage to using _mm256_load over loadu if you care about how your code compiles on GCC if someone forgets to use -mtune=... or -march= options.)Therapist
@PrunusPersica Did you end up getting this to work ? I have the same problem. We can work together if you wish ?Iand
@Iand I ended up using the code of boost::alignment::aligned_allocator. Then I could allocate the vector with std::vector<T, aligned_allocator<float>>. It does make normal std::vectors not directly compatible with this type of aligned vector, but you can always write ways around that.Loveinidleness
R
9

STL containers take an allocator template argument which can be used to align their internal buffers. The specified allocator type has to implement at least allocate, deallocate, and value_type.

In contrast to these answers, this implementation of such an allocator avoids platform-dependent aligned malloc calls. Instead, it uses the C++17 aligned new operator.

Here is the full example on godbolt.

#include <limits>
#include <new>

/**
 * Returns aligned pointers when allocations are requested. Default alignment
 * is 64B = 512b, sufficient for AVX-512 and most cache line sizes.
 *
 * @tparam ALIGNMENT_IN_BYTES Must be a positive power of 2.
 */
template<typename    ElementType,
         std::size_t ALIGNMENT_IN_BYTES = 64>
class AlignedAllocator
{
private:
    static_assert(
        ALIGNMENT_IN_BYTES >= alignof( ElementType ),
        "Beware that types like int have minimum alignment requirements "
        "or access will result in crashes."
    );

public:
    using value_type = ElementType;
    static std::align_val_t constexpr ALIGNMENT{ ALIGNMENT_IN_BYTES };

    /**
     * This is only necessary because AlignedAllocator has a second template
     * argument for the alignment that will make the default
     * std::allocator_traits implementation fail during compilation.
     * @see https://mcmap.net/q/586587/-create-the-simplest-allocator-with-two-template-arguments
     */
    template<class OtherElementType>
    struct rebind
    {
        using other = AlignedAllocator<OtherElementType, ALIGNMENT_IN_BYTES>;
    };

public:
    constexpr AlignedAllocator() noexcept = default;

    constexpr AlignedAllocator( const AlignedAllocator& ) noexcept = default;

    template<typename U>
    constexpr AlignedAllocator( AlignedAllocator<U, ALIGNMENT_IN_BYTES> const& ) noexcept
    {}

    [[nodiscard]] ElementType*
    allocate( std::size_t nElementsToAllocate )
    {
        if ( nElementsToAllocate
             > std::numeric_limits<std::size_t>::max() / sizeof( ElementType ) ) {
            throw std::bad_array_new_length();
        }

        auto const nBytesToAllocate = nElementsToAllocate * sizeof( ElementType );
        return reinterpret_cast<ElementType*>(
            ::operator new[]( nBytesToAllocate, ALIGNMENT ) );
    }

    void
    deallocate(                  ElementType* allocatedPointer,
                [[maybe_unused]] std::size_t  nBytesAllocated )
    {
        /* According to the C++20 draft n4868 § 17.6.3.3, the delete operator
         * must be called with the same alignment argument as the new expression.
         * The size argument can be omitted but if present must also be equal to
         * the one used in new. */
        ::operator delete[]( allocatedPointer, ALIGNMENT );
    }
};

This allocator can then be used like this:

#include <iostream>
#include <stdexcept>
#include <vector>

template<typename T, std::size_t ALIGNMENT_IN_BYTES = 64>
using AlignedVector = std::vector<T, AlignedAllocator<T, ALIGNMENT_IN_BYTES> >;

int
main()
{
    AlignedVector<int, 1024> buffer( 3333 );
    if ( reinterpret_cast<std::uintptr_t>( buffer.data() ) % 1024 != 0 ) {
        std::cerr << "Vector buffer is not aligned!\n";
        throw std::logic_error( "Faulty implementation!" );
    }

    std::cout << "Successfully allocated an aligned std::vector.\n";
    return 0;
}
Ramon answered 5/2, 2022 at 0:11 Comment(10)
C++17 supports over-aligned dynamic allocations, e.g. std::vector<__m256i> should Just Work. Is there no way to take advantage of that, instead of using ugly hacks that over-allocate and then leave part of the allocation unused?Therapist
@PeterCordes I think this is more a code style than performance issue because the overhead, e.g. 511 B, will be smaller than 1% in most cases. Of course, you can simply use something like reinterpret_cast<ElementType*>( new __m256i[ nBytesToAllocate / sizeof( __m256i ) ] ) as long as the 256 alignment is what you want. Using a dummy struct might be more portable though: struct DummyAligned{ alignas( 512 ) char[512] dummy; };. But note that this also will result in overallocation if your vector size is not a multiple of the alignment...Ramon
It's also extra bookkeeping to keep track of the address to free, separately from the address you're using. That's the main reason I don't like it.Therapist
@PeterCordes Ok, that is totally understandable. After a further experimentation and reading, I changed my answer to use the C++17 aligned new/delete operators instead.Ramon
@user17732522 I'm already checking for trivial types with the static asserts. I got the new/delete from here. I'm pretty sure it should fit? The new expression also just calls the new operator under the hood (and additionally calls the constructor), afaik. I didn't wanna use operator new because then I would have to convert the number of elements into number of bytes again with all the required overflow checking for that.Ramon
@user17732522 Thanks for the suggestion. I'm now using the new operator instead and can therefore remove the static_asserts and reduce the code a bit more. The godbolt link also contains an example with an object with a custom constructor and destructor.Ramon
I think the allocator is now correct, but unfortunately as mentioned in the other answer's comments, I don't think that it is guaranteed that std::vector will actually place its elements at the beginning of the allocation, which could mess up the alignment. (But I don't think any implementation implements vector that way.)Diverse
@Diverse Interesting edge case. I guess you would have to write your own container to be 100% sure. Or, add automated tests / asserts like I did in godbolt on the data() return value. For my usecase, if it works on all known systems, it works well enough.Ramon
MSVC doesn't like this one in debug builds; see https://mcmap.net/q/586588/-getting-weird-compiler-errors-while-compiling-my-own-quot-allocator-lt-gt-quot-with-msvc-2022-in-debug-mode/15416.Baden
@Baden Thank you for bringing this problem to my attention and also point to the solution. I could reproduce the problem on godbolt by adding /MTd and fixed it by adding all three required constructors, especially the templated conversion constructor for an allocator with a different value type.Ramon
T
0

All containers in the standard C++ library, including vectors, have an optional template parameter that specifies the container's allocator, and it is not really a lot of work to implement your own one:

class my_awesome_allocator {
};

std::vector<float, my_awesome_allocator> awesomely_allocated_vector;

You will have to write a little bit of code that implements your allocator, but it wouldn't be much more code than you already written. If you don't need pre-C++17 support you only need to implement the allocate() and deallocate() methods, that's it.

Technicolor answered 11/2, 2020 at 13:28 Comment(11)
They also need to specialize allocator_traitsHype
This might be a good place for a canonical answer with an example that people can copy/paste to jump through C++'s annoying hoops. (Bonus points if there's a way to let std::vector try to realloc in-place instead of the usual braindead C++ always alloc+copy.) Also of course note that this vector<float, MAA> is not type-compatible with vector<float> (and can't be because anything that does .push_back on a plain std::vector<float> compiled without this allocator could do a new allocation and copy into minimally-aligned memory. And new/delete isn't compatible with aligned_alloc/free)Therapist
I don't think there is any guarantee that the pointer returned from the allocator is directly used as the base address of the std::vector's array. For example, I could imagine an implementation of std::vector using just one pointer to the allocated memory which stores the end/capacity/allocator in the memory prior to the range of values. That could easily foil the alignment done by the allocator.Glossary
Except that std::vector guarantees it. That's what it uses it for. Perhaps you should review what the C++ standard specifies here.Technicolor
> They also need to specialize allocator_traits -- No, they don't. All that's needed is to implement a compliant allocator.Anlace
> Bonus points if there's a way to let std::vector try to realloc in-place instead of the usual braindead C++ always alloc+copy. -- There is no way, except to reserve the required capacity first and then insert elements as needed. There are good reasons why realloc is not an option. realloc does not call constructors and copying raw bytes is not valid for most types. Also, realloc usefullness is over-estimated, as most of the time increasing allocation size for any considerable amount is still equivalent to malloc+memcpy+free.Anlace
I can add that there is a good implementation of aligned allocator in Boost.Align: boost.org/doc/libs/1_72_0/doc/html/align/…Anlace
And, on topic of realloc, it doesn't necessarily preserve alignment.Anlace
the existence, and use of Boost's aligned allocator is sufficient for many needs, though the dependency is unfortunate. vector<float, MAA> and vector<float> not being type compatible also unfortunate, but as long as the underlying data is still floats it's okay. writing allocators is new to me. I made an attempt here, but had a return type errorLoveinidleness
@AndreySemashev "realloc usefulness is over-estimated" - Not true. realloc on Linux calls mremap, which for large buffers is more efficient than copying.Pneumonic
@Pneumonic This is only possible if the allocated memory is large enough - one or more contiguous pages - and was allocated using raw mmap in the first place. Large allocations are pretty rare. And besides, you have to hope that your realloc implementation actually does this mremap trick, which is not guaranteed. If your performance depends on large reallocations being performed efficiently, you're better off directly using mmap/mremap instead of relying on realloc possibly doing the right thing. You might eliminate the need to allocate memory entirely and e.g. map the data from a file.Anlace

© 2022 - 2024 — McMap. All rights reserved.