I'm trying to sum the elements of array in parallel with SIMD. To avoid locking I'm using combinable thread local which is not always aligned on 16 bytes because of that _mm_add_epi32 is throwing exception
concurrency::combinable<__m128i> sum_combine;
int length = 40; // multiple of 8
concurrency::parallel_for(0, length , 8, [&](int it)
{
__m128i v1 = _mm_load_si128(reinterpret_cast<__m128i*>(input_arr + it));
__m128i v2 = _mm_load_si128(reinterpret_cast<__m128i*>(input_arr + it + sizeof(uint32_t)));
auto temp = _mm_add_epi32(v1, v2);
auto &sum = sum_combine.local(); // here is the problem
TRACE(L"%d\n", it);
TRACE(L"add %x\n", &sum);
ASSERT(((unsigned long)&sum & 15) == 0);
sum = _mm_add_epi32(temp, sum);
}
);
here is defination of combinable from ppl.h
template<typename _Ty>
class combinable
{
private:
// Disable warning C4324: structure was padded due to __declspec(align())
// This padding is expected and necessary.
#pragma warning(push)
#pragma warning(disable: 4324)
__declspec(align(64))
struct _Node
{
unsigned long _M_key;
_Ty _M_value; // this might not be aligned on 16 bytes
_Node* _M_chain;
_Node(unsigned long _Key, _Ty _InitialValue)
: _M_key(_Key), _M_value(_InitialValue), _M_chain(NULL)
{
}
};
sometimes alignment is ok and the code works fine, but most of time its not working
I have tried to used the following, but this doesn't compile
union combine
{
unsigned short x[sizeof(__m128i) / sizeof(unsigned int)];
__m128i y;
};
concurrency::combinable<combine> sum_combine;
then auto &sum = sum_combine.local().y;
Any suggestions for correcting the alignment issue, still using combinable.
On x64 it works fine bcause of default 16 bytes alignment. On x86 sometimes alignment problems exists.
#pragma pack(push, 1)
and#pragma pack(pop)
directives in your struct definition? The compiler can pad your structure with 0's unless you explicitly tell it not to. – Obregonsizeof(uint32_t)
is completely inappropriate in this context - you're adding 4uint32_t
elements, not 4 bytes - it's just a coincidence that they have the same value. If anything it should besizeof(__m128i) / sizeof(uint32_t)
. Anyway, this is not the immediate problem, but bear it in mind, as it will be confusing to anyone else reading the code (including yourself in 6 months' time!). – Repine