SIMD alignment issue with PPL Combinable

Asked 12/7, 2015 at 14:45 Answered 12/7, 2015 at 15:48

I'm trying to sum the elements of array in parallel with SIMD. To avoid locking I'm using combinable thread local which is not always aligned on 16 bytes because of that _mm_add_epi32 is throwing exception

concurrency::combinable<__m128i> sum_combine;

int length = 40; // multiple of 8
concurrency::parallel_for(0, length , 8, [&](int it)
{

    __m128i v1 = _mm_load_si128(reinterpret_cast<__m128i*>(input_arr + it));
    __m128i v2 = _mm_load_si128(reinterpret_cast<__m128i*>(input_arr + it + sizeof(uint32_t)));

    auto temp = _mm_add_epi32(v1, v2);

    auto &sum = sum_combine.local();   // here is the problem 


    TRACE(L"%d\n", it);
    TRACE(L"add %x\n", &sum);

    ASSERT(((unsigned long)&sum & 15) == 0);

    sum = _mm_add_epi32(temp, sum);
}
);

here is defination of combinable from ppl.h

template<typename _Ty>
class combinable
{
private:

// Disable warning C4324: structure was padded due to __declspec(align())
// This padding is expected and necessary.
#pragma warning(push)
#pragma warning(disable: 4324)
    __declspec(align(64))
    struct _Node
    {
        unsigned long _M_key;
        _Ty _M_value;                   // this might not be aligned on 16 bytes
        _Node* _M_chain;

        _Node(unsigned long _Key, _Ty _InitialValue)
            : _M_key(_Key), _M_value(_InitialValue), _M_chain(NULL)
        {
        }
    };

sometimes alignment is ok and the code works fine, but most of time its not working

I have tried to used the following, but this doesn't compile

union combine 
{
        unsigned short x[sizeof(__m128i) / sizeof(unsigned int)];
        __m128i y;
};

concurrency::combinable<combine> sum_combine;
then auto &sum = sum_combine.local().y;

Any suggestions for correcting the alignment issue, still using combinable.

On x64 it works fine bcause of default 16 bytes alignment. On x86 sometimes alignment problems exists.

Tella answered 12/7, 2015 at 14:45 Comment(6)

Have you tried using the #pragma pack(push, 1) and #pragma pack(pop) directives in your struct definition? The compiler can pad your structure with 0's unless you explicitly tell it not to. – Obregon 12/7, 2015 at 14:51

@mike on whcih struct should i use #pragma pack(push, 1) – Tella 12/7, 2015 at 14:57

Is an unaligned load OK? You may lose some performance particularly on older processors (eg Core2) – Capitally 12/7, 2015 at 15:7

@harold but input_arr is already aligned, how will it effect sum alignment, is there _mm_addu_epi32 – Tella 12/7, 2015 at 15:14

@PaulR i had just used 4 instead of sizeof(uint32_t) – Tella 12/7, 2015 at 15:27

OK - but sizeof(uint32_t) is completely inappropriate in this context - you're adding 4 uint32_t elements, not 4 bytes - it's just a coincidence that they have the same value. If anything it should be sizeof(__m128i) / sizeof(uint32_t). Anyway, this is not the immediate problem, but bear it in mind, as it will be confusing to anyone else reading the code (including yourself in 6 months' time!). – Repine 12/7, 2015 at 15:34

Just loaded sum using unaligned load

auto &sum = sum_combine.local();


#if !defined(_M_X64) 

if (((unsigned long)&sum & 15) != 0)
{
    // just for breakpoint means, sum  is unaligned.
    int a = 5;
}
auto sum_temp = _mm_loadu_si128(&sum);
sum = _mm_add_epi32(temp, sum_temp);

#else

sum = _mm_add_epi32(temp, sum);

#endif

Tella answered 12/7, 2015 at 15:48 Comment(0)

Since the sum variable being used with _mm_add_epi32 is not aligned you need to explicitly load/store sum using unaligned loads/stores (_mm_loadu_si128/_mm_storeu_si128). Change:

sum = _mm_add_epi32(temp, sum);

to:

__m128i v2 = _mm_loadu_si128((__m128i *)&sum);
v2 = _mm_add_epi32(v2, temp);
_mm_storeu_si128((__m128i *)&sum, v2);

Repine answered 12/7, 2015 at 15:16 Comment(3)

still same behavior 50 % times exception. i tried__m128i v2 using _mm_loadu_si128. but what if sum.sum_combine() is unaligned – Tella 12/7, 2015 at 15:25

Oh, right - it's the second _mm_add_epi32 that's crashing - in that case you need to use explicit unaligned loads. Let me update my answer... – Repine 12/7, 2015 at 15:27

@Pual R thanks for answer and suggestion. i did it a bit differently. Now its working auto &sum = sum_combine.local(); auto sum_temp = _mm_loadu_si128(&sum); if(((unsigned long)&sum & 15) != 0) { // just for breakpoint this is unaligned. int a = 5; } sum = _mm_add_epi32(temp, sum_temp ); – Tella 12/7, 2015 at 15:41

Recommended topics

Hot tags