How to solve the 32-byte-alignment issue for AVX load/store operations?

Asked 16/9, 2015 at 14:57 Answered 16/9, 2015 at 17:35

I am having alignment issue while using ymm registers, with some snippets of code that seems fine to me. Here is a minimal working example:

#include <iostream> 
#include <immintrin.h>

inline void ones(float *a)
{
     __m256 out_aligned = _mm256_set1_ps(1.0f);
     _mm256_store_ps(a,out_aligned);
}

int main()
{
     size_t ss = 8;
     float *a = new float[ss];
     ones(a);

     delete [] a;

     std::cout << "All Good!" << std::endl;
     return 0;
}

Certainly, sizeof(float) is 4 on my architecture (Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz) and I'm compiling with gcc using -O3 -march=native flags. Of course the error goes away with unaligned memory access i.e. specifying _mm256_storeu_ps. I also do not have this problem on xmm registers, i.e.

inline void ones_sse(float *a)
{
     __m128 out_aligned = _mm_set1_ps(1.0f);
     _mm_store_ps(a,out_aligned);
}

Am I doing anything foolish? what is the work-around for this?

Past answered 16/9, 2015 at 14:57 Comment(6)

A bit off topic, but remember to use delete [] when deleting something allocated with new []. – Mai 16/9, 2015 at 15:4

did you try _mm_malloc instead of new? – Keesee 16/9, 2015 at 15:16

I guess a simple summary would be because new/malloc return 16-byte aligned pointer on x64; it's enough for SSE, but AVX needs 32-byte alignment. – Maxine 16/9, 2015 at 15:44

Relevant: stackoverflow.com/questions/12055822/… (addresses 16 byte SSE alignment but answers are easily adapted for 32 byte AVX alignment). – Isomerism 16/9, 2015 at 15:50

Perhaps this is interesting too: stackoverflow.com/questions/16376942/… – Maxine 16/9, 2015 at 16:19

can you try align it yourself, e.g. allocate 128 bytes and make second pointer points whatever you need inside larger buffer? Just to see if it will work. – Edelstein 1/5, 2017 at 9:9

Yes, you can use _mm256_loadu_ps / storeu for unaligned loads/stores (AVX: data alignment: store crash, storeu, load, loadu doesn't). If the compiler doesn't do a bad job (cough GCC default tuning), AVX _mm256_loadu/storeu on data that happens to be aligned is just as fast as alignment-required load/store, so aligning data when convenient still gives you the best of both worlds for functions that normally run on aligned data but let hardware handle the rare cases where they don't. (Instead of always running extra instructions to check stuff).

Alignment is especially important for 512-bit AVX-512 vectors, like 15 to 20% speed on SKX even over large arrays where you'd expect L3 / DRAM bandwidth to be the bottleneck, vs. a few percent with AVX2 CPUs for large arrays. (It can still matter significantly with AVX2 on modern CPUs if your data is hot in L2 or especially L1d cache, especially if you can come close to maxing out 2 loads and/or 1 store per clock. Cache-line splits cost about twice the throughput resources, plus needing a line-split buffer temporarily.)

The standard allocators normally only align to alignof(max_align_t), which is often 16B, e.g. long double in the x86-64 System V ABI. But in some 32-bit ABIs it's only 8B, so it's not even sufficient for dynamic allocation of aligned __m128 vectors and you'll need to go beyond simply calling new or malloc.

Static and automatic storage are easy: use alignas(32) float arr[N];

C++17 provides aligned new for aligned dynamic allocation. If alignof for a type is greater than the standard alignment, then aligned operator new/operator delete are used. So new __m256[N] just works in C++17 (if compiler supports this C++17 feature; check __cpp_aligned_new feature macro). In practice, GCC / clang / MSVC / ICX support it, ICC 2021 doesn't.

float *arr = new (std::align_val_t(32)) float[size];  // C++17

Without that C++17 feature, even stuff like std::vector<__m256> will break, not just std::vector<int>, unless you get lucky and it happens to be aligned by 32.

Plain-`delete` compatible allocation of a `float` / `int` array:

Unfortunately, auto* arr = new alignas(32) float[numSteps] does not work for all compilers, as alignas is applicable to a variable, a member, or a class declaration, but not as type modifier. (GCC accepts using vfloat = alignas(32) float;, so this does give you an aligned new that's compatible with ordinary delete on GCC).

Workarounds are either wrapping in a structure (struct alignas(32) s { float v; }; new s[numSteps];) or passing alignment as placement parameter (new (std::align_val_t(32)) float[numSteps];), in later case be sure to call matching aligned operator delete.

See documentation for new/new[] and std::align_val_t

Other options, incompatible with `new`/`delete`

Other options for dynamic allocation are mostly compatible with malloc/free, not new/delete:

std::aligned_alloc: ISO C++17. major downside: size must be a multiple of alignment. This braindead requirement makes it inappropriate for allocating a 64B cache-line aligned array of an unknown number of floats, for example. Or especially a 2M-aligned array to take advantage of transparent hugepages.

The C version of aligned_alloc was added in ISO C11. It's available in some but not all C++ compilers. As noted on the cppreference page, the C11 version wasn't required to fail when size isn't a multiple of alignment (it's undefined behaviour), so many implementations provided the obvious desired behaviour as an "extension". Discussion is underway to fix this, but for now I can't really recommend aligned_alloc as a portable way to allocate arbitrary-sized arrays. In practice some implementations work fine in the UB / required-to-fail cases so it can be a good non-portable option.

Also, commenters report it's unavailable in MSVC++. See best cross-platform method to get aligned memory for a viable #ifdef for Windows. But AFAIK there are no Windows aligned-allocation functions that produce pointers compatible with standard free.
posix_memalign: Part of POSIX 2001, not any ISO C or C++ standard. Clunky prototype/interface compared to aligned_alloc. I've seen gcc generate reloads of the pointer because it wasn't sure that stores into the buffer didn't modify the pointer. (posix_memalign is passed the address of the pointer, defeating escape analysis.) So if you use this, copy the pointer into another C++ variable that hasn't had its address passed outside the function.

#include <stdlib.h>
int posix_memalign(void **memptr, size_t alignment, size_t size);  // POSIX 2001
void *aligned_alloc(size_t alignment, size_t size);                // C11 (and ISO C++17)

_mm_malloc: Available on any platform where _mm_whatever_ps is available, but you can't pass pointers from it to free. On many C and C++ implementations _mm_free and free are compatible, but it's not guaranteed to be portable. (And unlike the other two, it will fail at run-time, not compile time.) On MSVC on Windows, _mm_malloc uses _aligned_malloc, which is not compatible with free; it crashes in practice.
Directly use system calls like mmap or VirtualAlloc. Appropriate for large allocations, and the memory you get is by definition page-aligned (4k, and perhaps even 2M largepage). Not compatible with free; you of course have to use munmap or VirtualFree which need the size as well as address. (For large allocations you usually want to hand memory back to the OS when you're done, rather than manage a free-list; glibc malloc uses mmap/munmap directly for malloc/free of blocks over a certain size threshold.)

Major advantage: you don't have to deal with C++'s and C's braindead refusal provide grow/shrink facilities for aligned allocators. If you want space for another 1MiB after your allocation, you can even use Linux's mremap(MREMAP_MAYMOVE) to let it pick a different place in virtual address space (if needed) for the same physical pages, without having to copy anything. Or if it doesn't have to move, the TLB entries for the currently in use part stay valid.

And since you're using OS system calls anyway (and know you're working with whole pages), you can use madvise(MADV_HUGEPAGE) to hint that transparent hugepages are preferred, or that they're not, for this range of anonymous pages. You can also use allocation hints with mmap e.g. for the OS to prefault the zero pages, or if mapping a file on hugetlbfs, to use 2M or 1G pages. (If that kernel mechanism still works).

And with madvise(MADV_FREE), you can keep it mapped, but let the kernel reclaim the pages as memory pressure occurs, making it like lazilly allocated zero-backed pages if that happens. So if you do reuse it soon, you may not suffer fresh page faults. But if you don't, you're not hogging it, and when you do read it, it's like a freshly mmapped region.

`alignas()` with arrays / structs

In C++11 and later: use alignas(32) float avx_array[1234] as the first member of a struct/class member (or on a plain array directly) so static and automatic storage objects of that type will have 32B alignment. std::aligned_storage documentation has an example of this technique to explain what std::aligned_storage does.

This doesn't actually work until C++17 for dynamically-allocated storage (like a std::vector<my_class_with_aligned_member_array>), see Making std::vector allocate aligned memory.

Starting in C++17, the compiler will pick aligned new for types with alignment enforced by alignas on the whole type or its member, also std::allocator will pick aligned new for such type, so nothing to worry about when creating std::vector of such types.

And finally, the last option is so bad it's not even part of the list: allocate a larger buffer and do p+=31; p&=~31ULL with appropriate casting. Too many drawbacks (hard to free, wastes memory) to be worth discussing, since aligned-allocation functions are available on every platform that support Intel _mm256_... intrinsics. But there are even library functions that will help you do this, IIRC, if you insist.

The requirement to use _mm_free instead of free probably exists in part for the possibility of implementing _mm_malloc on top of a plain old malloc using this technique. Or for an aligned allocator using an alternate free-list.

Gyp answered 16/9, 2015 at 15:27 Comment(20)

Could you please explain why you prefer POSIX-only function over platform independent _mm_malloc? – Maxine 16/9, 2015 at 15:38

Isn't _mm_malloc an informally-supported, un-standardized Intel extension? How could that be more platform independent than POSIX? – Emera 16/9, 2015 at 15:49

@stgatilov: The main advantage is that you can free them with free. If you have code that works with any alignment, but is faster with 32B alignment, then you can do an aligned alloc where convenient, so usually you're getting the fast case. Also, aligned_alloc is ISO C11, so it should be available everywhere (when compilers catch up, anyway). There's only one major non-POSIX x86 OS, so I guess you're thinking of MSVC. Does it not have either of those functions? I assumed MSVC would support as much of POSIX as they easily could, just not the system calls that don't map to Windows. – Gyp 16/9, 2015 at 15:56

@Useless: If you're using _mm_whatever intrinsics for SSE / AVX / other instructions, you will also have _mm_malloc available. If keeping your aligned allocs separate from your unaligned allocs isn't a problem, or you can just use _mm_malloc / _mm_free everywhere in your program, and don't interact with any libraries that allocate or free anything, then that's a valid option, too. – Gyp 16/9, 2015 at 15:58

@PeterCordes aligned_alloc looks best of the lot to me. Is there any general consensus on which one, one should use? – Past 16/9, 2015 at 16:8

@PeterCordes: I'm afraid MSVC is still not compatible. They have mostly abandoned compatibility with C standard. As for C++, aligned_alloc is not there. Personally I mostly use C++ because I like templates =) – Maxine 16/9, 2015 at 16:14

@Maxine I solely use C++ too and suffices to say, I'm having no issue compiling aligned_alloc with gcc 4.8.4 without #define-ing any macros, from the link you posted. If I am right, this seems to be a permanent feature in C++ as well. – Past 16/9, 2015 at 16:29

@romeric: That link is for C11, not C++11. Note the /c/ in the URL, not /cpp. If you ever need to port your code to MSVC, there's probably a similar function you can use with an #ifdef. I think aligned_alloc is probably the best choice, as long as it's not too new for any compilers you care about supporting. – Gyp 16/9, 2015 at 16:33

Since you mention C++17: alignas+dynamic allocation was finally fixed there. – Acicular 1/5, 2017 at 7:42

@PeterCordes: Thanks for this answer. Do you know if it is possible somehow to create aligned vector of basic types like std::vector<double> with a built-in allocator or similar? Something like std::vector<double, alignas(32)>? – Tunny 2/8, 2019 at 9:5

@matejk: I'm not sure if you have to write your own allocator or if there's already a template allocator you can customize. I'm totally unimpressed with C++ as far as alignment support for dynamic allocation, or exposing efficient realloc or calloc for std::vector to take advantage of. It's also just ridiculous how bad it is, and that it took until C++17 for new __m256[] to even work. I don't get WTF is so hard about making alignment a template parameter that becomes part of the type. And even C is missing a portable aligned realloc or calloc, AFAIK. – Gyp 2/8, 2019 at 9:13

In C++17, alignas just works. You just say new T for type with alignment enforced by alignas to be greater than __STDCPP_DEFAULT_NEW_ALIGNMENT__, and aligned form of operator new is called. std::allocator also vary of this, and calls aligned operator new when needed. – Kass 11/9, 2021 at 12:39

also new (std::align_val_t(32)) float[numSteps]; is not a good idea. Standard says that operator delete must be corresponding: eel.is/c++draft/new.delete.single#11 But using delete expression you can only call default delete (aligned according to type alignment, unaligned otherwise). Guess on which implementation unaligned delete on alinged new actually does crash? Sure, you can have std::destroy_n(p, numSteps); operator delete(p, std::align_val_t(32));, but for normal delete expressions to work everywhere, should only rely on align_val_t passed automatically. – Kass 11/9, 2021 at 13:10

@AlexGuteniev: Hrm, thanks. So new alignas(32) float [numSteps]? That works with GCC, but doesn't compile with clang, and with ICC doesn't pass an alignment request to the call to new. godbolt.org/z/WW7jd4Wra. I see that new __m256 Just Works on GCC and clang, but not ICC 2021. – Gyp 11/9, 2021 at 13:54

clang is right here. You can say struct alingas(32) s { float v; }; new s[numSteps];, but standard does not ask for new alignas(32) float to work: eel.is/c++draft/dcl.align#1 . GCC behavior looks like an extension to me, ICC behavior looks like a bug – Kass 11/9, 2021 at 14:17

@AlexGuteniev: I may look at this again later, or if you want to edit my answer again to say something about how to allocate an aligned array of float in a way that's compatible with delete (or a note about it being impossible to do portably/safely if that's the case), that would be great. – Gyp 11/9, 2021 at 14:37

Did it. Regarding ICC: it does not claim to support the feature: godbolt.org/z/cce4oGdj4 – Kass 11/9, 2021 at 15:21

@AlexGuteniev: Thanks for the edit, looks good. I did add back in the mention that ICC doesn't support it, and explicit mention of the other three major x86 compilers. Since this Q&A is about AVX, the set of relevant compilers is much smaller than for portable ISO C++ code / features. A run-down on the real world status of the big 4 compilers will save future readers time, or save them from having to make assumptions if they don't check. – Gyp 11/9, 2021 at 20:34

Maybe paged allocation also worth mentioning (Windows VirtualAlloc/POSIX mmap)? Overhead for wasted page remainder may not be significant for large array, or may even be zero for multiple-of-page sized arrays. (This is what I actually do in my code. I use large producer-consumer queues, in-place SSE/AVX load/store intrinsics on queue data parts, also reserved pages at edges add free bounds protection) – Kass 12/9, 2021 at 9:47

@AlexGuteniev: Good point; added a section about that, including some nice stuff you can do with Linux system calls like mremap that braindead C and C++ allocators refuse to expose APIs for. (Like zero-copy aligned realloc.) And also madvise for transparent hugepage hints. – Gyp 12/9, 2021 at 11:30

There are the two intrinsics for memory management. _mm_malloc operates like a standard malloc, but it takes an additional parameter that specifies the desired alignment. In this case, a 32 byte alignment. When this allocation method is used, memory must be freed by the corresponding _mm_free call.

float *a = static_cast<float*>(_mm_malloc(sizeof(float) * ss , 32));
...
_mm_free(a);

Keesee answered 16/9, 2015 at 15:20 Comment(0)

You'll need aligned allocators.

But there isn't a reason you can't bundle them up:

template<class T, size_t align>
struct aligned_free {
  void operator()(T* t)const{
    ASSERT(!(uint_ptr(t) % align));
    _mm_free(t);
  }
  aligned_free() = default;
  aligned_free(aligned_free const&) = default;
  aligned_free(aligned_free&&) = default;
  // allow assignment from things that are
  // more aligned than we are:
  template<size_t o,
    std::enable_if_t< !(o % align) >* = nullptr
  >
  aligned_free( aligned_free<T, o> ) {}
};
template<class T>
struct aligned_free<T[]>:aligned_free<T>{};

template<class T, size_t align=1>
using mm_ptr = std::unique_ptr< T, aligned_free<T, align> >;
template<class T, size_t align>
struct aligned_make;
template<class T, size_t align>
struct aligned_make<T[],align> {
  mm_ptr<T, align> operator()(size_t N)const {
    return mm_ptr<T, align>(static_cast<T*>(_mm_malloc(sizeof(T)*N, align)));
  }
};
template<class T, size_t align>
struct aligned_make {
  mm_ptr<T, align> operator()()const {
    return aligned_make<T[],align>{}(1);
  }
};
template<class T, size_t N, size_t align>
struct aligned_make<T[N], align> {
  mm_ptr<T, align> operator()()const {
    return aligned_make<T[],align>{}(N);
  }
}:
// T[N] and T versions:
template<class T, size_t align>
auto make_aligned()
-> std::result_of_t<aligned_make<T,align>()>
{
  return aligned_make<T,align>{}();
}
// T[] version:
template<class T, size_t align>
auto make_aligned(size_t N)
-> std::result_of_t<aligned_make<T,align>(size_t)>
{
  return aligned_make<T,align>{}(N);
}

now mm_ptr<float[], 4> is a unique pointer to an array of floats that is 4 byte aligned. You create it via make_aligned<float[], 4>(20), which creates 20 floats 4-byte aligned, or make_aligned<float[20], 4>() (compile-time constant only in that syntax). make_aligned<float[20],4> returns mm_ptr<float[],4> not mm_ptr<float[20],4>.

A mm_ptr<float[], 8> can move-construct a mm_ptr<float[],4> but not vice-versa, which I think is nice.

mm_ptr<float[]> can take any alignment, but guarantees none.

Overhead, like with a std::unique_ptr, is basically zero per pointer. Code overhead can be minimized by aggressive inlineing.

Joleen answered 16/9, 2015 at 17:35 Comment(1)

@Past from more to less – Joleen 16/9, 2015 at 22:3

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Plain-delete compatible allocation of a float / int array:

Other options, incompatible with new/delete

alignas() with arrays / structs

Recommended topics

Hot tags

Plain-`delete` compatible allocation of a `float` / `int` array:

Other options, incompatible with `new`/`delete`

`alignas()` with arrays / structs