What value of alignment should I with mkl_malloc?

Asked 3/8, 2018 at 1:6 Answered 13/8, 2018 at 23:15

Solved c linear-algebra memory-alignment blas intel-mkl

The function mkl_malloc is similar to malloc but has an extra alignment argument. Here's the prototype:

void* mkl_malloc (size_t alloc_size, int alignment);

I've noticed different performances with different values of alignment. Apart from trial and error, is there a canonical or documented methodical way to decide on the best value of alignment? i.e. processor being used, function being called, operation being performed etc.

This question widely applicable to anyone who uses MKL so I'm very surprised it is not in the reference manual.

update: I have tried with mkl_sparse_spmm and have not noticed a significant difference in performance for setting the alignment to powers of 2 up to 1024 bytes, after that the performance tends to drop. I'm using an Intel Xeon E5-2683.

Unpleasant answered 3/8, 2018 at 1:6 Comment(7)

does this answer to a previous question help? https://mcmap.net/q/2031424/-memory-alignment-with-mkl_malloc – Truly 3/8, 2018 at 1:28

Thanks but not it doesn't. I was wondering about what value of alignment would give me the best performance, for a given operation and hardware. – Unpleasant 3/8, 2018 at 2:8

intuitively, it would seem that the alignment should match the word size of the processor. What results are you seeing? – Truly 3/8, 2018 at 2:11

Thanks! I found that powers of 2 up to 1024 give very similar performance... But I have been trying sparse operations which are memory access bound. I shall try again with dense operations. – Unpleasant 6/8, 2018 at 3:36

In general, you can use _Alignof(max_align_t). However, it really depends on the type of data. On current x86-64, _Alignof(max_align_t) == 16, but 64 is needed for AVX512 vectors, and 32 for AVX2 vectors. – Idel 8/8, 2018 at 23:35

What operating system are you working on? – Precambrian 9/8, 2018 at 6:52

have not noticed a significant difference in performance for setting the alignment to powers of 2 up to 1024 bytes, after that the performance tends to drop. How are you testing performance? Larger alignments like that increase the chances you get a new virtual page of memory - one that's never been accessed by your program. The actual physical mapping is often delayed until the page is first written to. Make sure your benchmarks never operate on memory pages that your program hasn't already written to. – Adenoidal 13/8, 2018 at 14:44

Alignment only affects performance when SSE/AVX instructions can be used - this is commonly true when operating with arrays as you wish to apply the same operation to a range of elements.

In general, you want to choose alignment based on the CPU, if it supports AVX2 which has 256bit registers, then you want 32 byte alignment, if it supports AVX512, then 64 bytes would be optimal.

To that end, mkl_malloc will guarantee alignment to the value you specify, however, obviously if the data are 32-byte aligned, then they are also aligned to a (16, 8, 4...)-byte boundary. The purpose of the call is to ensure this is always the case and thus avoid any potential complications.

On my machine (Linux kernel 4.17.11 running on i7 6700K), the default alignment of mkl_malloc seems to be 128-bytes (for large enough arrays, if they are too small the value seems to be 32KB), in other words, any value smaller than that has no effect on alignment, I can however input 256 and the data will be aligned to the 256-byte boundary.

In contrast, using malloc gives me 16byte alignment for 1GB of data and 32-byte alignment for 1KB, whatever the OS gives me with absolutely no preference regarding alignment.

So using mkl_malloc makes sense as it ensures you get the alignment you desire. However, that doesn't mean you should set the value to be too large, that will simply cause you to waste memory and potentially expose you to an increased number of cache misses.

In short, you want your data to be aligned to the size of the vector registers in your CPU so that you can make use of the relevant extensions. Using mkl_malloc with some parameter for alignment guarantees alignment to at least that value, it can however be more. It should be used to make sure the data are aligned the way you want, but there is absolutely no good reason to align to 1MB.

Joeannjoed answered 13/8, 2018 at 14:34 Comment(0)

The only reason, why regardless of your input, you have no penalties / gains from specifying the alignment is that you get machine aligned memory no matter what you type in. So on your processor, which supports AVX, you are always getting 32 byte aligned memory regardless of your input.

You will also see, that whatever alignment value you go for, the memory address, which mkl_malloc, returns is divisible 32-aligned. Alternatively you may test that low level intrisics like _mm256_load_pd, which would seg fault, when a not 32 byte aligned address is used never seg fault.

Some minor details: OSX always gives you 32 byte address, independant of heap / stack when you allocate a chunk of memory, while Linux will always give you aligned memory, when allocating on heap. Stack is a matter of luck on Linux, but you exceed with small matrix size already the limit for stack allocations. I have no understanding of memory allocation on Windows.

I noticed the latter, when I was writing tests for my numerics library where I use std::vector<typename T, alignment A> for memory allocation and smaller matrix tests sometimes seg faulted on Linux.

TLDR: your alignment input is effectively discarded and you are getting machine alignment regardless.

Precambrian answered 9/8, 2018 at 7:6 Comment(0)

I think there can be no "best" value for alignment. Depending on your architecture, alignment is generally a property enforced by the hardware, for optimization reasons mostly.

Coming to your specific question, it's important to state what exactly are you allocating memory for? What piece of hw accesses the memory? For e.g., I have worked with DMA engines which required the source address to be aligned to per transaction transfer size(where xfer size = 4, 8, 16, 32, 128). I also worked with vector registers where it was wise to have a 128 bit aligned load.

To summarize: It depends.

Virescent answered 13/8, 2018 at 23:15 Comment(0)

Recommended topics

Hot tags