Asked 12/6, 2011 at 5:10 Answered 29/7, 2018 at 23:50

How do I align a pointer to a 16 byte boundary?

I found this code, not sure if its correct

char* p= malloc(1024);

if ((((unsigned long) p) % 16) != 0) 
{
     unsigned char *chpoint = (unsigned char *)p;        
     chpoint += 16 - (((unsigned long) p) % 16);
     p = (char *)chpoint;
}

Would this work?

thanks

Eleonoraeleonore answered 12/6, 2011 at 5:10 Comment(6)

What toolchain and platform? Alignment is a property of the implementation. Also, why do you want to do this? Do you really mean C++? Because your code is valid C but not valid C++. – Augur 12/6, 2011 at 5:12

https://mcmap.net/q/162887/-how-to-align-a-pointer-in-c may be of interest (since you appear to be writing C anyway). – Augur 12/6, 2011 at 5:14

Do not use that code. If you try to free p it's going to cause you all kinds of problems. Furthermore, you no longer have a guarantee about how much memory you actually have (somewhere between 1009 and 1024) – Rinderpest 12/6, 2011 at 5:15

I had a piece of code that allowed you to do something like this (provided that you used a custom free function), but which did not require outside memory -- it simply allocated a bit more than you requested, and freed that at the end. I'll post it here soon, if I find it. – Perinephrium 12/6, 2011 at 6:2

I re-wrote it and posted it; try using it and see if it's helpful. – Perinephrium 12/6, 2011 at 6:17

if you have access to MSVC (express or not), it comes with the source for the crt, in particular, _aligned_malloc, which has a whole long explanation on how to align a pointer to an arbitrary value and retain the original pointer (needless to say it has the code too :P). – Exiguous 12/6, 2011 at 7:3

C++0x proposes std::align, which does just that.

// get some memory
T* const p = ...;
std::size_t const size = ...;

void* start = p;
std::size_t space = size;
void* aligned = std::align(16, 1024, p, space);
if(aligned == nullptr) {
    // failed to align
} else {
    // here, p is aligned to 16 and points to at least 1024 bytes of memory
    // also p == aligned
    // size - space is the amount of bytes used for alignment
}

which seems very low-level. I think

// also available in Boost flavour
using storage = std::aligned_storage_t<1024, 16>;
auto p = new storage;

also works. You can easily run afoul of aliasing rules though if you're not careful. If you had a precise scenario in mind (fit N objects of type T at a 16 byte boundary?) I think I could recommend something nicer.

Agitato answered 12/6, 2011 at 5:24 Comment(2)

This is an old answer, nevertheless there are two points: 1. typedef std::aligned_storage<1024, 16> storage; is wrong. storage should be typedefed to either aligned_storage_t<1024,16> or aligned_storage<1024,16>::type if this approach is to be used. 2. There is no guarantee for that to work with overaligned types. It may be the case that alignof(aligned_storage_t<1024,N>) < N if N > alignof(std::max_align_t). – Zeldazelde 22/5, 2016 at 14:56

@Zeldazelde Thanks, fixed. As for over-aligned types, that’s less to do with std::aligned_storage_t itself and more with their implementation-defined nature. What to do exactly is best left to the programmer. – Agitato 22/5, 2016 at 19:40

Try this:

It returns aligned memory and frees the memory, with virtually no extra memory management overhead.

#include <malloc.h>
#include <assert.h>

size_t roundUp(size_t a, size_t b) { return (1 + (a - 1) / b) * b; }

// we assume here that size_t and void* can be converted to each other
void *malloc_aligned(size_t size, size_t align = sizeof(void*))
{
    assert(align % sizeof(size_t) == 0);
    assert(sizeof(void*) == sizeof(size_t)); // not sure if needed, but whatever

    void *p = malloc(size + 2 * align);  // allocate with enough room to store the size
    if (p != NULL)
    {
        size_t base = (size_t)p;
        p = (char*)roundUp(base, align) + align;  // align & make room for storing the size
        ((size_t*)p)[-1] = (size_t)p - base;      // store the size before the block
    }
    return p;
}

void free_aligned(void *p) { free(p != NULL ? (char*)p - ((size_t*)p)[-1] : p); }

Warning:

I'm pretty sure I'm stepping on parts of the C standard here, but who cares. :P

Perinephrium answered 12/6, 2011 at 6:17 Comment(8)

This is a couple of years late, but there's a cool intptr_t stashed away in cstdint that is guaranteed to be convertible to and from pointers and is guaranteed to support arithmetic. – Couchman 31/1, 2014 at 14:23

@mehrdad can you put some comments in you code. Also why you used (2*align) "void *p = malloc(size + 2 * align);" in malloc and let's if I want alignment of 8/16 that the above code will work fine ? – Chosen 16/1, 2016 at 16:48

@eswaat: roundUp(base, align) + align can easily increment base up by more than just align. – Perinephrium 16/1, 2016 at 16:52

@Mehrdad Sorry didn't get you, can you please elaborate or put some comments in your code like why 2*align etc etc ? Asking more but this is first time so. Thanks. – Chosen 16/1, 2016 at 16:55

@eswaat: Does this help? – Perinephrium 16/1, 2016 at 17:32

@Mehrdad Yes better Thanks ! So, I was playing with your code and used one align instead of double and still it is giving me correct answer like this "void *p = malloc(size + align);" PS: I am using alignment of 8 not void *. So I am still confuse with we need double of alignment. Rest of stuff are clear. If you can give one example why we need double of alignment that will be good. – Chosen 16/1, 2016 at 17:38

@eswaat: If base == 0x3 and size == 1 and align == 4 and if the allocated memory is instead of size size + 1 * align == 5 then where do you propose the size could be stored? Address 0x0 doesn't belong to you, and if you store it at address 0x4 then you won't have room after it for the originally requested 1 byte. – Perinephrium 16/1, 2016 at 18:31

Let us continue this discussion in chat. – Chosen 16/1, 2016 at 18:48

In glibc library malloc, realloc always returns 8 bytes aligned. If you want to allocate memory with some alignment which is a higher power 2 then you can use memalign and posix_memalign. Read http://www.gnu.org/s/hello/manual/libc/Aligned-Memory-Blocks.html

Mancunian answered 12/6, 2011 at 6:8 Comment(0)

posix_memalign is one way: http://pubs.opengroup.org/onlinepubs/009695399/functions/posix_memalign.html as long as your size is a power of two.

The problem with the solution you provide is that you run the risk of writing off the end of your allocated memory. An alternative solution is to alloc the size you want + 16 and to use a similar trick to the one you're doing to get a pointer that is aligned, but still falls within your allocated region. That said, I'd use posix_memalign as a first solution.

Collincolline answered 12/6, 2011 at 5:14 Comment(0)

Updated: New Faster Algorithm

Don't use modulo because it takes hundreds of clock cycles on x86 due to the nasty division and a lot more on other systems. I came up with a faster version of std::align than GCC and Visual-C++. Visual-C++ has the slowest implementation, which actually uses an amateurish conditional statement. GCC is very similar to my algorithm but I did the opposite of what they did but my algorithm is 13.3 % faster because it has 13 as opposed to 15 single-cycle instructions. See here is the research paper with dissassembly. The algorithm is actually one instruction faster if you use the mask instead of the pow_2.

/* Quickly aligns the given pointer to a power of two boundaries.
@return An aligned pointer of typename T.
@desc Algorithm is a 2's compliment trick that works by masking off
the desired number in 2's compliment and adding them to the
pointer. Please note how I took the horizontal comment whitespace back.
@param pointer The pointer to align.
@param mask Mask for the lower LSb, which is one less than the power of 
2 you wish to align too. */
template <typename T = char>
inline T* AlignUp(void* pointer, uintptr_t mask) {
  intptr_t value = reinterpret_cast<intptr_t>(pointer);
  value += (-value) & mask;
  return reinterpret_cast<T*>(value);
}

Here is how you call it:

enum { kSize = 256 };
char buffer[kSize + 16];
char* aligned_to_16_byte_boundary = AlignUp<> (buffer, 15); //< 16 - 1 = 15
char16_t* aligned_to_64_byte_boundary = AlignUp<char16_t> (buffer, 63);

Here is the quick bit-wise proof for 3 bits, it works the same for all bit counts:

~000 = 111 => 000 + 111 + 1 = 0x1000
~001 = 110 => 001 + 110 + 1 = 0x1000
~010 = 101 => 010 + 101 + 1 = 0x1000
~011 = 100 => 011 + 100 + 1 = 0x1000
~100 = 011 => 100 + 011 + 1 = 0x1000
~101 = 010 => 101 + 010 + 1 = 0x1000
~110 = 001 => 110 + 001 + 1 = 0x1000
~111 = 000 => 111 + 000 + 1 = 0x1000

Just in case you're here to learn how to align to a cache line an object in C++11, use the in-place constructor:

struct Foo { Foo () {} };
Foo* foo = new (AlignUp<Foo> (buffer, 63)) Foo ();

Here is the std::align implmentation, it uses 24 instructions where the GCC implementation uses 31 instructions, though it can be tweaked to eliminate a decrement instruction by turning (--align) to the mask for the Least Significant bits but that would not operate functionally identical to std::align.

inline void* align(size_t align, size_t size, void*& ptr,
                   size_t& space) noexcept {
   intptr_t int_ptr = reinterpret_cast<intptr_t>(ptr),
           offset = (-int_ptr) & (--align);
  if ((space -= offset) < size) {
    space += offset;
    return nullptr;
  }
  return reinterpret_cast<void*>(int_ptr + offset);
}

Faster to Use mask rather than pow_2

Here is the code for aligning using a mask rather than the the pow_2 (which is the even power of 2). This is 20% fatert than the GCC algorithm but requires you to store the mask rather than the pow_2 so it's not interchangable.

inline void* AlignMask(size_t mask, size_t size, void*& ptr,
                   size_t& space) noexcept {
   intptr_t int_ptr = reinterpret_cast<intptr_t>(ptr),
           offset = (-int_ptr) & mask;
  if ((space -= offset) < size) {
    space += offset;
    return nullptr;
  }
  return reinterpret_cast<void*>(int_ptr + offset);
}

Bots answered 29/7, 2018 at 23:50 Comment(7)

I am very happy with the new algorithm. – Bots 19/8, 2018 at 6:42

Two things: (1) use optimization flags (-O2), if you do a benchmark. (2) speed comparison is not just comparing instruction count. – Gonophore 19/8, 2018 at 7:5

Oops, thanks for pointing that out. I'm still learning how to benchmark. You can benchmark based on instruction count if all of the instructions are single-cycle. I turned on O2 Optimizations. It looks like the instruction count is now 13 to 15 single-cycle excluding jump, so it's only 13.3% faster, but those two extra instructions do put me ahead. But that is with using the pow_2 rather than mask so it's fairer to say 12 to 15 instructions, or 20% faster. – Bots 19/8, 2018 at 16:50

In this case, since all of the instructions are single-cycle with a single jump statement, it would be more accurate to benchmark based on instruction count because of the inaccuracies in benchmarking on an operating system. I have a lovely article on benchmarking in C++ if you're interested. github.com/kabuki-starship/kabuki-toolkit/wiki/… – Bots 19/8, 2018 at 17:3

Benchmark based on instruction count is only a crude approximation. Nowadays, processors are very complex, they are pipelined, have out of order execution, etc. Measuring correctly a routine is hard, needs experience. Those days are long gone, when we could simply add cycle counts to get the overall speed of a routine. Your benchmark article is a step into the right direction, because it measures time. – Gonophore 19/8, 2018 at 20:37

I'm sorry, but that is a newb thing to say. If all of the instructions are single-cycle instructions and one has 13 instructions and one has 15, you can clearly say one is 13.3% faster. And if you can eliminate another instruction then you can say it's 20% faster, or ((1-3/15)*100)% faster. – Bots 20/8, 2018 at 1:55

OK I updated the paper with the O2 optimized disassembly. None can dispute they are virtually identical except mine uses 2 fewer instructions, and one more if you use the mask rather than the pow_2. – Bots 20/8, 2018 at 2:53

few things:

don't change the pointer returned by the malloc/new: you'll need it later to free the memory;
make sure your buffer is big enough after adjusting the alignment
use size_t instead of unsigned long, since size_t guaranteed to have the same size as the pointer, as opposed to anything else:

here's the code:

size_t size = 1024; // this is how many bytes you need in the aligned buffer
size_t align = 16;  // this is the alignment boundary
char *p = (char*)malloc(size + align); // see second point above
char *aligned_p = (char*)((size_t)p + (align - (size_t)p % align));
// use the aligned_p here
// ...
// when you're done, call:
free(p); // see first point above

Nonconformist answered 12/6, 2011 at 5:46 Comment(3)

size_t isn't guaranteed to have the same size as a pointer. (I've used 16 bits size_t with 32 bits pointers in DOS time), unsigned long is in fact often a better choice that size_t for that (but isn't formally guaranteed either and won't work on windows 64 bit). uintptr_t has been introduced in C99 and C++0X for that purpose and is often available as extension before. C++0X even offer a standard way to align (see Luc's answer) – Mathias 12/6, 2011 at 6:47

well, x86 real mode can't be discussed in the same sentence with C++ 0x, can it? Regarding the real mode long pointers, the pointer arithmetic was rather non-trivial back then, so it had to be re-written anyway. On top of that, the size of the pointers returned by the allocation functions used to depend on the version of the CRT library used. Some programming environments adjusted size_t definition accordingly. – Nonconformist 12/6, 2011 at 7:14

Why not? The underling abstract machine of C++0X still gather for segmented access, as it still gather for strangest things like word addressable machine or one's complement arithmetic. – Mathias 12/6, 2011 at 16:18

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Try this:

Warning:

Updated: New Faster Algorithm

Faster to Use mask rather than pow_2

Recommended topics

Hot tags