I am new to optimizing code with SSE/SSE2 instructions and until now I have not gotten very far. To my knowledge a common SSE-optimized function would look like this:
void sse_func(const float* const ptr, int len){
if( ptr is aligned )
{
for( ... ){
// unroll loop by 4 or 2 elements
}
for( ....){
// handle the rest
// (non-optimized code)
}
} else {
for( ....){
// regular C code to handle non-aligned memory
}
}
}
However, how do I correctly determine if the memory ptr
points to is aligned by e.g. 16 Bytes? I think I have to include the regular C code path for non-aligned memory as I cannot make sure that every memory passed to this function will be aligned. And using the intrinsics to load data from unaligned memory into the SSE registers seems to be horrible slow (Even slower than regular C code).
Thank you in advance...
a[i] = foo(b[i])
), do a potentially-unaligned first vector, then the main loop starting at the first alignment boundary after the first vector, then a final vector that ends at the last element. If the array was in fact misaligned and/or the count wasn't a multiple of the vector width, then some of those vectors will overlap, but that still beats scalar. – Orsa