How to allocate aligned memory only using the standard library?
Asked Answered
T

17

467

I just finished a test as part of a job interview, and one question stumped me, even using Google for reference. I'd like to see what the StackOverflow crew can do with it:

The memset_16aligned function requires a 16-byte aligned pointer passed to it, or it will crash.

a) How would you allocate 1024 bytes of memory, and align it to a 16 byte boundary?
b) Free the memory after the memset_16aligned has executed.

{    
   void *mem;
   void *ptr;

   // answer a) here

   memset_16aligned(ptr, 0, 1024);

   // answer b) here    
}
Thadthaddaus answered 22/10, 2008 at 23:23 Comment(8)
hmmm...for long-term code viability, how about "Fire whoever wrote memset_16aligned and fix it or replace it so that it doesn't have a peculiar boundary condition"Exist
Certainly a valid question to ask - "why the peculiar memory alignment". But there can be good reasons for it - in this case, it could be that the memset_16aligned() can use 128-bit integers and this is easier if the memory is known to be aligned. Etc.Billington
Whoever wrote memset could use internal 16-byte alignment for clearing the inner loop and a small data prolog/epilog to clean up the non-aligned ends. That would be much easier than making coders handle extra memory pointers.Solidago
Just malloc(1024);. All malloc(3) implementations on modern systems are already at least this aligned anyways.Pasteboard
Why would someone want data aligned to a 16 byte boundary? Probably to load it into 128bit SSE registers. I believe the (newer) unaligned movs (eg, movupd, lddqu) are slower, or perhaps they are targeting processors without SSE2/3Articulation
Aligning address leads to optimized usage of cache as well as higher bandwidth between different levels of cache and RAM (for most common workloads). See here https://mcmap.net/q/17221/-purpose-of-memory-alignmentBrough
If there's justice in the world, the answer is ptr=malloc_16aligned(1024); and free_16aligned(ptr) because anyone who writes a usable library needs to provide utilities for accessing it. NB: if malloc_16aligned(.) calls malloc(.) it needs to add an overhead so free_16aligned(ptr) can unshift ptr and call free(.).Directory
@StevenA.Lowe There are valid use-cases for requiring 16-byte aligned memory when you want the using code to be able to run really fast without having to do extra checks. Apple's Metal graphics framework (similar to Vulkan) has this requirement in some places— which you can't complain too much about, since Metal is designed for devices with shared CPU/GPU RAM (and thus, zero-copy handoff, access-locking/unlocking, and mutation updates).Gooey
B
642

Original answer

{
    void *mem = malloc(1024+16);
    void *ptr = ((char *)mem+16) & ~ 0x0F;
    memset_16aligned(ptr, 0, 1024);
    free(mem);
}

Fixed answer

{
    void *mem = malloc(1024+15);
    void *ptr = ((uintptr_t)mem+15) & ~ (uintptr_t)0x0F;
    memset_16aligned(ptr, 0, 1024);
    free(mem);
}

Explanation as requested

The first step is to allocate enough spare space, just in case. Since the memory must be 16-byte aligned (meaning that the leading byte address needs to be a multiple of 16), adding 16 extra bytes guarantees that we have enough space. Somewhere in the first 16 bytes, there is a 16-byte aligned pointer. (Note that malloc() is supposed to return a pointer that is sufficiently well aligned for any purpose. However, the meaning of 'any' is primarily for things like basic types — long, double, long double, long long, and pointers to objects and pointers to functions. When you are doing more specialized things, like playing with graphics systems, they can need more stringent alignment than the rest of the system — hence questions and answers like this.)

The next step is to convert the void pointer to a char pointer; GCC notwithstanding, you are not supposed to do pointer arithmetic on void pointers (and GCC has warning options to tell you when you abuse it). Then add 16 to the start pointer. Suppose malloc() returned you an impossibly badly aligned pointer: 0x800001. Adding the 16 gives 0x800011. Now I want to round down to the 16-byte boundary — so I want to reset the last 4 bits to 0. 0x0F has the last 4 bits set to one; therefore, ~0x0F has all bits set to one except the last four. Anding that with 0x800011 gives 0x800010. You can iterate over the other offsets and see that the same arithmetic works.

The last step, free(), is easy: you always, and only, return to free() a value that one of malloc(), calloc() or realloc() returned to you — anything else is a disaster. You correctly provided mem to hold that value — thank you. The free releases it.

Finally, if you know about the internals of your system's malloc package, you could guess that it might well return 16-byte aligned data (or it might be 8-byte aligned). If it was 16-byte aligned, then you'd not need to dink with the values. However, this is dodgy and non-portable — other malloc packages have different minimum alignments, and therefore assuming one thing when it does something different would lead to core dumps. Within broad limits, this solution is portable.

Someone else mentioned posix_memalign() as another way to get the aligned memory; that isn't available everywhere, but could often be implemented using this as a basis. Note that it was convenient that the alignment was a power of 2; other alignments are messier.

One more comment — this code does not check that the allocation succeeded.

Amendment

Windows Programmer pointed out that you can't do bit mask operations on pointers, and, indeed, GCC (3.4.6 and 4.3.1 tested) does complain like that. So, an amended version of the basic code — converted into a main program, follows. I've also taken the liberty of adding just 15 instead of 16, as has been pointed out. I'm using uintptr_t since C99 has been around long enough to be accessible on most platforms. If it wasn't for the use of PRIXPTR in the printf() statements, it would be sufficient to #include <stdint.h> instead of using #include <inttypes.h>. [This code includes the fix pointed out by C.R., which was reiterating a point first made by Bill K a number of years ago, which I managed to overlook until now.]

#include <assert.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

static void memset_16aligned(void *space, char byte, size_t nbytes)
{
    assert((nbytes & 0x0F) == 0);
    assert(((uintptr_t)space & 0x0F) == 0);
    memset(space, byte, nbytes);  // Not a custom implementation of memset()
}

int main(void)
{
    void *mem = malloc(1024+15);
    void *ptr = (void *)(((uintptr_t)mem+15) & ~ (uintptr_t)0x0F);
    printf("0x%08" PRIXPTR ", 0x%08" PRIXPTR "\n", (uintptr_t)mem, (uintptr_t)ptr);
    memset_16aligned(ptr, 0, 1024);
    free(mem);
    return(0);
}

And here is a marginally more generalized version, which will work for sizes which are a power of 2:

#include <assert.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

static void memset_16aligned(void *space, char byte, size_t nbytes)
{
    assert((nbytes & 0x0F) == 0);
    assert(((uintptr_t)space & 0x0F) == 0);
    memset(space, byte, nbytes);  // Not a custom implementation of memset()
}

static void test_mask(size_t align)
{
    uintptr_t mask = ~(uintptr_t)(align - 1);
    void *mem = malloc(1024+align-1);
    void *ptr = (void *)(((uintptr_t)mem+align-1) & mask);
    assert((align & (align - 1)) == 0);
    printf("0x%08" PRIXPTR ", 0x%08" PRIXPTR "\n", (uintptr_t)mem, (uintptr_t)ptr);
    memset_16aligned(ptr, 0, 1024);
    free(mem);
}

int main(void)
{
    test_mask(16);
    test_mask(32);
    test_mask(64);
    test_mask(128);
    return(0);
}

To convert test_mask() into a general purpose allocation function, the single return value from the allocator would have to encode the release address, as several people have indicated in their answers.

Problems with interviewers

Uri commented: Maybe I am having [a] reading comprehension problem this morning, but if the interview question specifically says: "How would you allocate 1024 bytes of memory" and you clearly allocate more than that. Wouldn't that be an automatic failure from the interviewer?

My response won't fit into a 300-character comment...

It depends, I suppose. I think most people (including me) took the question to mean "How would you allocate a space in which 1024 bytes of data can be stored, and where the base address is a multiple of 16 bytes". If the interviewer really meant how can you allocate 1024 bytes (only) and have it 16-byte aligned, then the options are more limited.

  • Clearly, one possibility is to allocate 1024 bytes and then give that address the 'alignment treatment'; the problem with that approach is that the actual available space is not properly determinate (the usable space is between 1008 and 1024 bytes, but there wasn't a mechanism available to specify which size), which renders it less than useful.
  • Another possibility is that you are expected to write a full memory allocator and ensure that the 1024-byte block you return is appropriately aligned. If that is the case, you probably end up doing an operation fairly similar to what the proposed solution did, but you hide it inside the allocator.

However, if the interviewer expected either of those responses, I'd expect them to recognize that this solution answers a closely related question, and then to reframe their question to point the conversation in the correct direction. (Further, if the interviewer got really stroppy, then I wouldn't want the job; if the answer to an insufficiently precise requirement is shot down in flames without correction, then the interviewer is not someone for whom it is safe to work.)

The world moves on

The title of the question has changed recently. It was Solve the memory alignment in C interview question that stumped me. The revised title (How to allocate aligned memory only using the standard library?) demands a slightly revised answer — this addendum provides it.

C11 (ISO/IEC 9899:2011) added function aligned_alloc():

7.22.3.1 The aligned_alloc function

Synopsis

#include <stdlib.h>
void *aligned_alloc(size_t alignment, size_t size);

Description
The aligned_alloc function allocates space for an object whose alignment is specified by alignment, whose size is specified by size, and whose value is indeterminate. The value of alignment shall be a valid alignment supported by the implementation and the value of size shall be an integral multiple of alignment.

Returns
The aligned_alloc function returns either a null pointer or a pointer to the allocated space.

And POSIX defines posix_memalign():

#include <stdlib.h>

int posix_memalign(void **memptr, size_t alignment, size_t size);

DESCRIPTION

The posix_memalign() function shall allocate size bytes aligned on a boundary specified by alignment, and shall return a pointer to the allocated memory in memptr. The value of alignment shall be a power of two multiple of sizeof(void *).

Upon successful completion, the value pointed to by memptr shall be a multiple of alignment.

If the size of the space requested is 0, the behavior is implementation-defined; the value returned in memptr shall be either a null pointer or a unique pointer.

The free() function shall deallocate memory that has previously been allocated by posix_memalign().

RETURN VALUE

Upon successful completion, posix_memalign() shall return zero; otherwise, an error number shall be returned to indicate the error.

Either or both of these could be used to answer the question now, but only the POSIX function was an option when the question was originally answered.

Behind the scenes, the new aligned memory function do much the same job as outlined in the question, except they have the ability to force the alignment more easily, and keep track of the start of the aligned memory internally so that the code doesn't have to deal with specially — it just frees the memory returned by the allocation function that was used.

Billington answered 22/10, 2008 at 23:27 Comment(43)
What the mask is doing, is making sure that the lower four bits are 0, which is equivalent with saying that the address is 16-byte aligned (because the address, viewed as an integer, is divisible by 16).Devisal
But first he adds 16 because setting the lowest bits to zero is going to drop the memory address which won't be in the range he allocated. Note also that he adds to the size of the allocation to make room for the fact that it will go up some...Invidious
And i'm rusty with C++, but I don't really trust that ~ 0x0F will properly expand to the size of the pointer. If it doesn't, all hell will break loose because you will mask off the most significant bits of your pointer as well. I could be wrong about that though.Invidious
BTW '+15' works as well as '+16'...no practical impact in this situation though.Corves
The '+ 15' comments from Menkboy and Greg are correct, but malloc() would almost certainly round that up to 16 anyway. Using +16 is marginally easier to explain. The generalized solution is fiddly, but doable.Billington
The 'enough bits' question is interesting. I suppose that writing: ~(uintptr_t)0x0F would guaranteee the right number of bits, improving reliability. It requires C99 for that, though, whereas what's there will work with C89 too (and some pre-standard C; the void pointers get dodgy then.)Billington
My earlier comment (and Menkboy's) are a little incorrect, if you want to only add +15 to the malloc'ed size, you need to special case when malloc returns a 16 byte aligned pointer and not add the +16 to it or you have a buffer overrun by one byte.Bitten
@Jonathan: size_t instead of uintptr_t should work on all architectures, although as far as I know it isn't guaranteed that sizeof(size_t) == sizeof(void*), as it is with uintptr_t.Bitten
No need to special case: just add 15 in the alignment step the same as the alloc step.Sandpit
I don't believe that binary & will work when one operand is a pointer. You'll have to convert both the char* pointer and the 0x0f constant to some sufficiently long integer type, and then convert the result to void*. There is no 100% reliable answer but size_t comes closer than any other type.Whence
@Greg: if you allocate +15, you add +15 before masking :)Billington
@Windows Programmer: Interesting; yes, GCC (3.4.6 and 4.3.1) does complain about the mixture. I'll make an edit to the main answer.Billington
Maybe I am having reading comprehension problem this morning, but if the interview question specifically says: "How would you allocate 1024 bytes of memory" and you clearly allocate more than that. Wouldn't that be an automatic failure from the interviewer?Pareira
@JonathanLeffler.. i understand that the mask to zero down the last 4 bits is used to make the address divisible by 16, could you suggest what would be the mask if i want a 32-bit aligned address?Regurgitation
@DeepanjanMazumdar: It depends on what you mean by '32-bit aligned' address. If you mean aligned on a multiple of 4 bytes, then: (a) you'd almost certainly never have to worry about it since double normally requires 8 byte alignment and the data returned by malloc() is sufficiently aligned for double; and (b) the constants to add would be 3 or 4 and the mask 0x03. If you have some other meaning for 32-bit aligned address, please clarify. (I could conceive of you meaning a 4 GiB alignment, but it seems a bit improbable, and malloc() might not work reliably for that much memory!).Billington
I believe @DeepanjanMazumdar means 32-byte aligned, and the answer already covers that in the general version.Klausenburg
Never seen anything like this. About to start interviewing for jobs. Horrified that I'm sub-par. Please tell me this is as serious a software engineering question as it seems, not something a normal programmer would know/comprehend offhand.Crosseyed
@Aerovistae: It is slightly a trick question, and mostly hinges on your understanding of how to make an arbitrary number (actually the address that is returned by the memory allocator) match a certain requirement (multiple of 16). If you were told to round up 53 to the nearest multiple of 16, how would you do that? The process is not very different for addresses; it's just that the numbers you're typically dealing with are bigger. Don't forget, interview questions are asked to find out how you think, not to find out whether you know the answer.Billington
Just out of curiosity, the printf("0x%08" PRIXPTR ", 0x%08" PRIXPTR "\n", mem, ptr); isn't right, is it? Changing it to printf("%x PRIXPTR , %x PRIXPTR \n", mem, ptr); makes it compile on my machineClower
@akristmann: The original code is correct if you have <inttypes.h> from C99 available (at least for the format string — arguably, the values should be passed with a cast: (uintptr_t)mem, (uintptr_t)ptr). The format string relies on string concatenation and the PRIXPTR macro is the correct printf() length and type specifier for hex output for a uintptr_t value. The alternative is to use %p but the output from that varies by platform (some add a leading 0x, most don't) and is typically written with lower-case hex digits, which I dislike; what I wrote is uniform across platforms.Billington
You should cast 0x0F to uintptr_t too before the bitwise NOT and AND operation because some compilers will treat an integer literal 0x0F as 32-bit.Frightful
@C.R. Hmmm...the bitwise & is OK; since one side is already converted to uintptr_t, the compiler must convert the other to uintptr_t too, so that 0x0F is handled correctly. However, I can't wriggle out of the bitwise ~ problem so easily; you're correct that should be cast (it has to be a cast; there isn't a reliable suffix notation like ULL that will work for all possible variants of uintptr_t) before being bitwise inverted. Thanks — well done for spotting it.Billington
@JonathanLeffler: I am just reiterating Bill K's comment a year ago. His words are not very clear in meaning, which may be why it evaded your attention.Frightful
@C.R. — answer updated giving credit to both of you. Thanks both for pointing out the problem and for saying that Bill K spotted it first. I'm not sure how/why I missed it before; probably, too many edits and comments in a flurry at the time.Billington
This taught me a little bit more about pointers and memory allocation - I did some bit-masking stuff two years ago for an architecture course; this is some pretty cool stuff :)Besetting
@JonathanLeffler Is it possible for either of the two examples you provided (thank you, by the way) to work for memory sizes that are not powers of 2? Additionally, would it be sufficient to just change the integer literals 15 and 0xF to extend this code safely to do things like 32, 48, and 64 byte alignment? Thanks!Palatial
@Dogbert: you could modify the code to work with sizes other than a power of two, but not using bit masking — you'd need to do arithmetic on the uintptr_t value corresponding to the address. Changing the constants to handle powers of two (which are constrained to be at least 16) is handled by the last code sample. Handling a multiple of 48 would require general arithmetic. For allocating S bytes aligned on an N byte boundary, allocate (S + N - 1) bytes, which would be allocated at address M, and return ((M + N - 1) / N) * N, with suitable handling of casts, etc.Billington
Another explanation here: jongampark.wordpress.com/2008/06/12/…Unwritten
Be wary, that if you use g++, you must have __STDC_FORMAT_MACROS macro defined: #define __STDC_FORMAT_MACROSDefense
@VladislavsBurakovs: can you characterize which version of G++ and which platform? For example, on Mac OS X 10.9.5 using GCC 4.9.1, the macros such as PRIXPTR are available to g++ without specifying __STDC_FORMAT_MACROS. I did need to fix the calls to printf() to include the (uintptr_t) casts. The code shown compiles cleanly with gcc -O3 -g -std=c11 -Wall -Wextra -Wmissing-prototypes -Wstrict-prototypes -Wold-style-definition -Werror memalign2.c -o memalign2 and g++ -O3 -g -std=c++11 -Wall -Wextra -Werror memalign3.cpp -o memalign3.Billington
@Jonathan Leffler: I'm using Windows 7 with MinGW onboard (gcc version 4.8.1). MinGW's inttypes.h uses these guards around the actual definitions: #if !defined(__cplusplus) || defined(__STDC_FORMAT_MACROS)Defense
Well, 4.8.1 isn't all that old, that's for sure. C99 included a footnote about __STDC_FORMAT_MACROS. (Footnote 182: C++ implementations should define these macros only when __STDC_FORMAT_MACROS is defined before <inttypes.h> is included.) C11 does not contain that footnote. I guess that the environment it runs in takes that to heart; it appears that either GCC 4.9.1 (compared with 4.8.1) or Mac OS X 10.9.5 (compared with Windows 7) follows the C11 standard rather than the C99 standard. I guess it is worth pointing out, though the question is actually about C rather than C++.Billington
I noticed that standard doesn't define casting a pointer to uintptr_t changing its value and casting back. Perhaps casting to (char*) and adding the correct offset would be defined.Revulsive
@2501: If you want a standards compliant solution, use posix_memalign(). You're right, but I don't think it's material. It would be feasible to design a solution which converts the pointer to a char *, adds an appropriate number of bytes calculated in a manner similar to what is done now, and then converts that char * back to a void *, if you wish to be more sure. In practice, I don't think you'll find a compiler that has problems. If there was one, it would be something like the one I learned C on, where byte pointers had different values from 'word pointers' for the same address.Billington
@JonathanLeffler I'm kinda'ish handling my own memory for structures and I often use your solution, to reduce malloc calls. Do you mean near and far pointers? I don't think those are in the standard.Revulsive
Actually I think I figured out how to make it completely portable. Malloc an aligned pointer( you must know the minimum returned alignment, which malloc guarantees ) and then calculate offset in a separate integer, and only increment the pointer using (char*). You never need to check the bits of the pointer.Revulsive
@JonathanLeffler: IMHO, C should have included a bitwise "and not" operator, but since it didn't an alternative in this case is & -16 since promotion to an unsigned type must yield the value of that type which, when added to 16, would yield zero.Punishment
Is there a way to get the minimum alignment requirement on a system ?Warfore
What do you mean by 'the minimum alignment requirement'? Normally this is 'one byte alignment' for char type. If you mean 'maximum' or 'most stringent' alignment, there isn't a simple way, but you can usually get the information out of a union using long double, double, void *, void (*)(void), long long, long — it's unlikely you'll need other types. Create a union of those types union Align { … };; create a structure struct Alignment { char c; union Align u; }. The value of offsetof(struct Alignment, u) gives you the maximum alignment requirement on your platform.Billington
I think the best thing about this answer is that it highlights the nuances with pointers and casting - uintptr_t might not store a (void *) in the obvious way - think early RISC machines where everything was 4 or 8 byte aligned, and char / short access was synthetic...Savage
Move C11 / posix examples to top of answer :-)Fichte
Basic question, How does adding 15 byte make sure we have enough space and we don't mess up with the pointer that is returned while doing & ~15? for example if I want to allocate 3 byte and align the pointer by 16 byte.....how the math works?Pacify
@JonathanLeffler, Note that malloc() is supposed to return a pointer that is sufficiently well aligned for any purpose I'm afraid that's the reason they asked him for a 16 byte aligned pointer, because any malloc will acomplish at least 8 byte alignment (for a pointer or a double) the correction you publish is also a clever and elegant improvement. Congratulations!Atticism
S
64

Three slightly different answers depending how you look at the question:

1) Good enough for the exact question asked is Jonathan Leffler's solution, except that to round up to 16-aligned, you only need 15 extra bytes, not 16.

A:

/* allocate a buffer with room to add 0-15 bytes to ensure 16-alignment */
void *mem = malloc(1024+15);
ASSERT(mem); // some kind of error-handling code
/* round up to multiple of 16: add 15 and then round down by masking */
void *ptr = ((char*)mem+15) & ~ (size_t)0x0F;

B:

free(mem);

2) For a more generic memory allocation function, the caller doesn't want to have to keep track of two pointers (one to use and one to free). So you store a pointer to the 'real' buffer below the aligned buffer.

A:

void *mem = malloc(1024+15+sizeof(void*));
if (!mem) return mem;
void *ptr = ((char*)mem+sizeof(void*)+15) & ~ (size_t)0x0F;
((void**)ptr)[-1] = mem;
return ptr;

B:

if (ptr) free(((void**)ptr)[-1]);

Note that unlike (1), where only 15 bytes were added to mem, this code could actually reduce the alignment if your implementation happens to guarantee 32-byte alignment from malloc (unlikely, but in theory a C implementation could have a 32-byte aligned type). That doesn't matter if all you do is call memset_16aligned, but if you use the memory for a struct then it could matter.

I'm not sure off-hand what a good fix is for this (other than to warn the user that the buffer returned is not necessarily suitable for arbitrary structs) since there's no way to determine programatically what the implementation-specific alignment guarantee is. I guess at startup you could allocate two or more 1-byte buffers, and assume that the worst alignment you see is the guaranteed alignment. If you're wrong, you waste memory. Anyone with a better idea, please say so...

[Added: The 'standard' trick is to create a union of 'likely to be maximally aligned types' to determine the requisite alignment. The maximally aligned types are likely to be (in C99) 'long long', 'long double', 'void *', or 'void (*)(void)'; if you include <stdint.h>, you could presumably use 'intmax_t' in place of long long (and, on Power 6 (AIX) machines, intmax_t would give you a 128-bit integer type). The alignment requirements for that union can be determined by embedding it into a struct with a single char followed by the union:

struct alignment
{
    char     c;
    union
    {
        intmax_t      imax;
        long double   ldbl;
        void         *vptr;
        void        (*fptr)(void);
    }        u;
} align_data;
size_t align = (char *)&align_data.u.imax - &align_data.c;

You would then use the larger of the requested alignment (in the example, 16) and the align value calculated above.

On (64-bit) Solaris 10, it appears that the basic alignment for the result from malloc() is a multiple of 32 bytes.
]

In practice, aligned allocators often take a parameter for the alignment rather than it being hardwired. So the user will pass in the size of the struct they care about (or the least power of 2 greater than or equal to that) and all will be well.

3) Use what your platform provides: posix_memalign for POSIX, _aligned_malloc on Windows.

4) If you use C11, then the cleanest - portable and concise - option is to use the standard library function aligned_alloc that was introduced in this version of the language specification.

Sandpit answered 22/10, 2008 at 23:23 Comment(9)
I agree - I think the intent of the question is that the code that frees the memory block would have access only to the 'cooked' 16-byte aligned pointer.Rafaello
For a general solution - you are right. However, the code template in the question clearly shows both.Billington
Sure, and in a good interview what happens is that you give your answer, then if the interviewer does want to see my answer, they change the question.Sandpit
I object to using ASSERT(mem); to check allocation results; assert is for catching programming errors and not lack of run-time resources.Dialectology
@hloval: ASSERT is a placeholder, it is not a standard macro, and may or may not be related in any way to the standard macro assert. I'm not going to write an unchecked allocation, but in general it's not possible to guess how questioners' programs might handle memory allocation failure. In this case though, since the second case is an allocation routine, I suppose I can guess - returning a null pointer is a reasonable way to indicate failure.Sandpit
Using binary & with a char * and a size_t will result in an error. You'd have to use something like uintptr_t.Intersidereal
Basic question, How does adding 15 byte make sure we have enough space and we don't mess up with the pointer that is returned while doing & ~15? for example if I want to allocate 3 byte and align the pointer by 16 byte.....how the math works?Pacify
@RaGa__M: If you allocate a buffer ptr of size n+15, then one of the following addresses must be 16-aligned, because they're all different modulo 16, and there are 16 of them: (char*) ptr, ((char*) ptr) + 1, ((char*) ptr) + 2 ... ((char*) ptr) + 15. So, that's why 15 bytes is enough. The expression with &~15 chooses one of those 16 pointers (because ~15 is the bit pattern with all 1 except for 0000 at the end, so it chooses the one ending 0000, which is the 16-aligned one).Sandpit
Whichever one it chooses, there are at least n bytes you can legally access in the buffer beyond that point. So, you have a usable 16-aligned buffer of size at least n.Sandpit
D
40

You could also try posix_memalign() (on POSIX platforms, of course).

Devisal answered 22/10, 2008 at 23:36 Comment(2)
And _aligned_malloc on Windows.Sandpit
Adding to this a few years later, the "aligned_alloc" function is now a part of the C11 specification: open-std.org/jtc1/sc22/wg14/www/docs/n1516.pdf (page 346)Interosculate
S
21

Here's an alternate approach to the 'round up' part. Not the most brilliantly coded solution but it gets the job done, and this type of syntax is a bit easier to remember (plus would work for alignment values that aren't a power of 2). The uintptr_t cast was necessary to appease the compiler; pointer arithmetic isn't very fond of division or multiplication.

void *mem = malloc(1024 + 15);
void *ptr = (void*) ((uintptr_t) mem + 15) / 16 * 16;
memset_16aligned(ptr, 0, 1024);
free(mem);
Screak answered 23/10, 2008 at 0:46 Comment(2)
In general, where you have 'unsigned long long', you also have uintptr_t which is explicitly defined to be big enough to hold a data pointer (void *). But your solution does indeed have merits if, for some reason, you needed an alignment that was not a power of 2. Unlikely, but possible.Billington
@Andrew: Upvoted for this type of syntax is a bit easier to remember (plus would work for alignment values that aren't a power of 2).Bookstand
U
20

Unfortunately, in C99 it seems pretty tough to guarantee alignment of any sort in a way which would be portable across any C implementation conforming to C99. Why? Because a pointer is not guaranteed to be the "byte address" one might imagine with a flat memory model. Neither is the representation of uintptr_t so guaranteed, which itself is an optional type anyway.

We might know of some implementations which use a representation for void * (and by definition, also char *) which is a simple byte address, but by C99 it is opaque to us, the programmers. An implementation might represent a pointer by a set {segment, offset} where offset could have who-knows-what alignment "in reality." Why, a pointer could even be some form of hash table lookup value, or even a linked-list lookup value. It could encode bounds information.

In a recent C1X draft for a C Standard, we see the _Alignas keyword. That might help a bit.

The only guarantee C99 gives us is that the memory allocation functions will return a pointer suitable for assignment to a pointer pointing at any object type. Since we cannot specify the alignment of objects, we cannot implement our own allocation functions with responsibility for alignment in a well-defined, portable manner.

It would be good to be wrong about this claim.

Unhitch answered 7/8, 2010 at 10:36 Comment(1)
C11 has aligned_alloc(). (C++11 / 14 / 1z still don't have it). _Alignas() and C++ alignas() don't do anything for dynamic allocation, only for automatic and static storage (or struct layout).Phylissphyll
S
15

On the 16 vs 15 byte-count padding front, the actual number you need to add to get an alignment of N is max(0,N-M) where M is the natural alignment of the memory allocator (and both are powers of 2).

Since the minimal memory alignment of any allocator is 1 byte, 15=max(0,16-1) is a conservative answer. However, if you know your memory allocator is going to give you 32-bit int aligned addresses (which is fairly common), you could have used 12 as a pad.

This isn't important for this example but it might be important on an embedded system with 12K of RAM where every single int saved counts.

The best way to implement it if you're actually going to try to save every byte possible is as a macro so you can feed it your native memory alignment. Again, this is probably only useful for embedded systems where you need to save every byte.

In the example below, on most systems, the value 1 is just fine for MEMORY_ALLOCATOR_NATIVE_ALIGNMENT, however for our theoretical embedded system with 32-bit aligned allocations, the following could save a tiny bit of precious memory:

#define MEMORY_ALLOCATOR_NATIVE_ALIGNMENT    4
#define ALIGN_PAD2(N,M) (((N)>(M)) ? ((N)-(M)) : 0)
#define ALIGN_PAD(N) ALIGN_PAD2((N), MEMORY_ALLOCATOR_NATIVE_ALIGNMENT)
Solidago answered 21/10, 2009 at 16:40 Comment(0)
D
9

Perhaps they would have been satisfied with a knowledge of memalign? And as Jonathan Leffler points out, there are two newer preferable functions to know about.

Oops, florin beat me to it. However, if you read the man page I linked to, you'll most likely understand the example supplied by an earlier poster.

Disciplinarian answered 22/10, 2008 at 23:42 Comment(1)
Note that the current (February 2016) version of the referenced page says "The memalign function is obsolete and aligned_alloc or posix_memalign should be used instead". I don't know what it said in October 2008 — but it probably didn't mention aligned_alloc() as that was added to C11.Billington
S
5

I'm surprised noone's voted up Shao's answer that, as I understand it, it is impossible to do what's asked in standard C99, since converting a pointer to an integral type formally is undefined behavior. (Apart from the standard allowing conversion of uintptr_t <-> void*, but the standard does not seem to allow doing any manipulations of the uintptr_t value and then converting it back.)

Samekh answered 14/7, 2011 at 16:34 Comment(1)
There's no requirement that a uintptr_t type exist, or that its bits have any relation to bits in the underlying pointer. If one were to over-allocate storage, store the pointer as an unsigned char* myptr; and then compute `mptr += (16-(uintptr_t)my_ptr) & 0x0F, behavior would be defined on all implementations that define my_ptr, but whether the resulting pointer would be aligned would depend upon the mapping between uintptr_t bits and addresses.Punishment
C
5

We do this sort of thing all the time for Accelerate.framework, a heavily vectorized OS X / iOS library, where we have to pay attention to alignment all the time. There are quite a few options, one or two of which I didn't see mentioned above.

The fastest method for a small array like this is just stick it on the stack. With GCC / clang:

 void my_func( void )
 {
     uint8_t array[1024] __attribute__ ((aligned(16)));
     ...
 }

No free() required. This is typically two instructions: subtract 1024 from the stack pointer, then AND the stack pointer with -alignment. Presumably the requester needed the data on the heap because its lifespan of the array exceeded the stack or recursion is at work or stack space is at a serious premium.

On OS X / iOS all calls to malloc/calloc/etc. are always 16 byte aligned. If you needed 32 byte aligned for AVX, for example, then you can use posix_memalign:

void *buf = NULL;
int err = posix_memalign( &buf, 32 /*alignment*/, 1024 /*size*/);
if( err )
   RunInCirclesWaivingArmsWildly();
...
free(buf);

Some folks have mentioned the C++ interface that works similarly.

It should not be forgotten that pages are aligned to large powers of two, so page-aligned buffers are also 16 byte aligned. Thus, mmap() and valloc() and other similar interfaces are also options. mmap() has the advantage that the buffer can be allocated preinitialized with something non-zero in it, if you want. Since these have page aligned size, you will not get the minimum allocation from these, and it will likely be subject to a VM fault the first time you touch it.

Cheesy: Turn on guard malloc or similar. Buffers that are n*16 bytes in size such as this one will be n*16 bytes aligned, because VM is used to catch overruns and its boundaries are at page boundaries.

Some Accelerate.framework functions take in a user supplied temp buffer to use as scratch space. Here we have to assume that the buffer passed to us is wildly misaligned and the user is actively trying to make our life hard out of spite. (Our test cases stick a guard page right before and after the temp buffer to underline the spite.) Here, we return the minimum size we need to guarantee a 16-byte aligned segment somewhere in it, and then manually align the buffer afterward. This size is desired_size + alignment - 1. So, In this case that is 1024 + 16 - 1 = 1039 bytes. Then align as so:

#include <stdint.h>
void My_func( uint8_t *tempBuf, ... )
{
    uint8_t *alignedBuf = (uint8_t*) 
                          (((uintptr_t) tempBuf + ((uintptr_t)alignment-1)) 
                                        & -((uintptr_t) alignment));
    ...
}

Adding alignment-1 will move the pointer past the first aligned address and then ANDing with -alignment (e.g. 0xfff...ff0 for alignment=16) brings it back to the aligned address.

As described by other posts, on other operating systems without 16-byte alignment guarantees, you can call malloc with the larger size, set aside the pointer for free() later, then align as described immediately above and use the aligned pointer, much as described for our temp buffer case.

As for aligned_memset, this is rather silly. You only have to loop in up to 15 bytes to reach an aligned address, and then proceed with aligned stores after that with some possible cleanup code at the end. You can even do the cleanup bits in vector code, either as unaligned stores that overlap the aligned region (providing the length is at least the length of a vector) or using something like movmaskdqu. Someone is just being lazy. However, it is probably a reasonable interview question if the interviewer wants to know whether you are comfortable with stdint.h, bitwise operators and memory fundamentals, so the contrived example can be forgiven.

Coffeecolored answered 5/6, 2014 at 5:19 Comment(0)
P
4

usage of memalign, Aligned-Memory-Blocks might be a good solution for the problem.

Phosphorism answered 12/10, 2010 at 18:9 Comment(1)
Note that the current (February 2016) version of the referenced page says "The memalign function is obsolete and aligned_alloc or posix_memalign should be used instead". I don't know what it said in October 2010.Billington
K
3

The first thing that popped into my head when reading this question was to define an aligned struct, instantiate it, and then point to it.

Is there a fundamental reason I'm missing since no one else suggested this?

As a sidenote, since I used an array of char (assuming the system's char is 8 bits (i.e. 1 byte)), I don't see the need for the __attribute__((packed)) necessarily (correct me if I'm wrong), but I put it in anyway.

This works on two systems I tried it on, but it's possible that there is a compiler optimization that I'm unaware of giving me false positives vis-a-vis the efficacy of the code. I used gcc 4.9.2 on OSX and gcc 5.2.1 on Ubuntu.

#include <stdio.h>
#include <stdlib.h>

int main ()
{

   void *mem;

   void *ptr;

   // answer a) here
   struct __attribute__((packed)) s_CozyMem {
       char acSpace[16];
   };

   mem = malloc(sizeof(struct s_CozyMem));
   ptr = mem;

   // memset_16aligned(ptr, 0, 1024);

   // Check if it's aligned
   if(((unsigned long)ptr & 15) == 0) printf("Aligned to 16 bytes.\n");
   else printf("Rubbish.\n");

   // answer b) here
   free(mem);

   return 1;
}
Kerry answered 10/5, 2016 at 21:28 Comment(0)
C
1

MacOS X specific:

  1. All pointers allocated with malloc are 16 bytes aligned.
  2. C11 is supported, so you can just call aligned_malloc (16, size).

  3. MacOS X picks code that is optimised for individual processors at boot time for memset, memcpy and memmove and that code uses tricks that you've never heard of to make it fast. 99% chance that memset runs faster than any hand-written memset16 which makes the whole question pointless.

If you want a 100% portable solution, before C11 there is none. Because there is no portable way to test alignment of a pointer. If it doesn't have to be 100% portable, you can use

char* p = malloc (size + 15);
p += (- (unsigned int) p) % 16;

This assumes that the alignment of a pointer is stored in the lowest bits when converting a pointer to unsigned int. Converting to unsigned int loses information and is implementation defined, but that doesn't matter because we don't convert the result back to a pointer.

The horrible part is of course that the original pointer must be saved somewhere to call free () with it. So all in all I would really doubt the wisdom of this design.

Crompton answered 25/11, 2013 at 13:23 Comment(2)
Where are you finding aligned_malloc in OS X? I'm using Xcode 6.1 and it's not defined anywhere in the iOS SDK, nor is it declared anywhere in /usr/include/*.Solicitous
Ditto for XCode 7.2 on El Capitan (Mac OS X 10.11.3). The C11 function is, in any case, aligned_alloc(), but that isn't declared either. From GCC 5.3.0, I get the interesting messages alig.c:7:15: error: incompatible implicit declaration of built-in function ‘aligned_alloc’ [-Werror] and alig.c:7:15: note: include ‘<stdlib.h>’ or provide a declaration of ‘aligned_alloc’. The code did indeed include <stdlib.h>, but neither -std=c11 nor -std=gnu11 changed the error messages.Billington
L
0

You can also add some 16 bytes and then push the original ptr to 16bit aligned by adding the (16-mod) as below the pointer :

main(){
void *mem1 = malloc(1024+16);
void *mem = ((char*)mem1)+1; // force misalign ( my computer always aligns)
printf ( " ptr = %p \n ", mem );
void *ptr = ((long)mem+16) & ~ 0x0F;
printf ( " aligned ptr = %p \n ", ptr );

printf (" ptr after adding diff mod %p (same as above ) ", (long)mem1 + (16 -((long)mem1%16)) );


free(mem1);
}
Liminal answered 25/3, 2013 at 18:27 Comment(0)
B
0

If there are constraints that, you cannot waste a single byte, then this solution works: Note: There is a case where this may be executed infinitely :D

   void *mem;  
   void *ptr;
try:
   mem =  malloc(1024);  
   if (mem % 16 != 0) {  
       free(mem);  
       goto try;
   }  
   ptr = mem;  
   memset_16aligned(ptr, 0, 1024);
Brough answered 25/11, 2013 at 14:0 Comment(2)
There's a very good chance that if you allocate and then free a block of N bytes and then request another block of N bytes, that the original block will be returned again. So an infinite loop is very likely if the first allocation doesn't meet the alignment requirement. Of course, that avoids wasting a single byte at the cost of wasting a lot of CPU cycles.Billington
Are you sure the % operator is defined for void* in a meaningful way?Vitalism
G
0

For the solution i used a concept of padding which aligns the memory and do not waste the memory of a single byte .

If there are constraints that, you cannot waste a single byte. All pointers allocated with malloc are 16 bytes aligned.

C11 is supported, so you can just call aligned_alloc (16, size).

void *mem = malloc(1024+16);
void *ptr = ((char *)mem+16) & ~ 0x0F;
memset_16aligned(ptr, 0, 1024);
free(mem);
Grossularite answered 14/3, 2014 at 14:5 Comment(1)
On many 64-bit systems, the pointer returned by malloc() is indeed aligned on a 16-byte boundary, but nothing in any standard guarantees that — it will simply be sufficiently well aligned for any use, and on many 32-bit systems aligning on an 8-byte boundary is sufficient, and for some, a 4-byte boundary is sufficient.Billington
N
-1
size =1024;
alignment = 16;
aligned_size = size +(alignment -(size %  alignment));
mem = malloc(aligned_size);
memset_16aligned(mem, 0, 1024);
free(mem);

Hope this one is the simplest implementation, let me know your comments.

Nothingness answered 6/11, 2019 at 18:46 Comment(0)
K
-3
long add;   
mem = (void*)malloc(1024 +15);
add = (long)mem;
add = add - (add % 16);//align to 16 byte boundary
ptr = (whatever*)(add);
Kevin answered 4/9, 2012 at 8:58 Comment(2)
I think there is a problem with this because your add will point to a location that is not malloc'd - Not sure how this worked on yours.Liminal
@Sam It should be add += 16 - (add % 16). (2 - (2 % 16)) == 0.Ciceronian

© 2022 - 2024 — McMap. All rights reserved.