Updated: New Faster Algorithm
Don't use modulo because it takes hundreds of clock cycles on x86 due to the nasty division and a lot more on other systems. I came up with a faster version of std::align than GCC and Visual-C++. Visual-C++ has the slowest implementation, which actually uses an amateurish conditional statement. GCC is very similar to my algorithm but I did the opposite of what they did but my algorithm is 13.3 % faster because it has 13 as opposed to 15 single-cycle instructions. See here is the research paper with dissassembly. The algorithm is actually one instruction faster if you use the mask instead of the pow_2.
/* Quickly aligns the given pointer to a power of two boundaries.
@return An aligned pointer of typename T.
@desc Algorithm is a 2's compliment trick that works by masking off
the desired number in 2's compliment and adding them to the
pointer. Please note how I took the horizontal comment whitespace back.
@param pointer The pointer to align.
@param mask Mask for the lower LSb, which is one less than the power of
2 you wish to align too. */
template <typename T = char>
inline T* AlignUp(void* pointer, uintptr_t mask) {
intptr_t value = reinterpret_cast<intptr_t>(pointer);
value += (-value) & mask;
return reinterpret_cast<T*>(value);
}
Here is how you call it:
enum { kSize = 256 };
char buffer[kSize + 16];
char* aligned_to_16_byte_boundary = AlignUp<> (buffer, 15); //< 16 - 1 = 15
char16_t* aligned_to_64_byte_boundary = AlignUp<char16_t> (buffer, 63);
Here is the quick bit-wise proof for 3 bits, it works the same for all bit counts:
~000 = 111 => 000 + 111 + 1 = 0x1000
~001 = 110 => 001 + 110 + 1 = 0x1000
~010 = 101 => 010 + 101 + 1 = 0x1000
~011 = 100 => 011 + 100 + 1 = 0x1000
~100 = 011 => 100 + 011 + 1 = 0x1000
~101 = 010 => 101 + 010 + 1 = 0x1000
~110 = 001 => 110 + 001 + 1 = 0x1000
~111 = 000 => 111 + 000 + 1 = 0x1000
Just in case you're here to learn how to align to a cache line an object in C++11, use the in-place constructor:
struct Foo { Foo () {} };
Foo* foo = new (AlignUp<Foo> (buffer, 63)) Foo ();
Here is the std::align implmentation, it uses 24 instructions where the GCC implementation uses 31 instructions, though it can be tweaked to eliminate a decrement instruction by turning (--align)
to the mask
for the Least Significant bits but that would not operate functionally identical to std::align.
inline void* align(size_t align, size_t size, void*& ptr,
size_t& space) noexcept {
intptr_t int_ptr = reinterpret_cast<intptr_t>(ptr),
offset = (-int_ptr) & (--align);
if ((space -= offset) < size) {
space += offset;
return nullptr;
}
return reinterpret_cast<void*>(int_ptr + offset);
}
Faster to Use mask rather than pow_2
Here is the code for aligning using a mask rather than the the pow_2 (which is the even power of 2). This is 20% fatert than the GCC algorithm but requires you to store the mask rather than the pow_2 so it's not interchangable.
inline void* AlignMask(size_t mask, size_t size, void*& ptr,
size_t& space) noexcept {
intptr_t int_ptr = reinterpret_cast<intptr_t>(ptr),
offset = (-int_ptr) & mask;
if ((space -= offset) < size) {
space += offset;
return nullptr;
}
return reinterpret_cast<void*>(int_ptr + offset);
}
p
it's going to cause you all kinds of problems. Furthermore, you no longer have a guarantee about how much memory you actually have (somewhere between 1009 and 1024) – Rinderpestfree
function), but which did not require outside memory -- it simply allocated a bit more than you requested, and freed that at the end. I'll post it here soon, if I find it. – Perinephrium_aligned_malloc
, which has a whole long explanation on how to align a pointer to an arbitrary value and retain the original pointer (needless to say it has the code too :P). – Exiguous