How to vectorize a distance calculation using SSE2

C

2

4

A and B are vectors or length N, where N could be in the range 20 to 200 say. I want to calculate the square of the distance between these vectors, i.e. d^2 = ||A-B||^2.

So far I have:

float* a = ...;
float* b = ...;
float d2 = 0;

for(int k = 0; k < N; ++k)
{
    float d = a[k] - b[k];
    d2 += d * d;
}

That seems to work fine, except that I have profiled my code and this is the bottleneck (more than 50% of time is spent just doing this). I am using Visual Studio 2012, on Win 7, with these optimization options: /O2 /Oi /Ot /Oy-. My understanding is that VS2012 should auto-vectorize that loop (using SSE2). However if I insert #pragma loop(no_vector) in the code I don't get a noticable slow down, so I guess the loop is not being vectorized. The compiler confirms that with this message:

  info C5002: loop not vectorized due to reason '1105'

My questions are:

Is it possible to fix this code so that VS2012 can vectorize it?
If not, would it make sense to try to vectorize the code myself?
Can you recommend a web site for me to learn about SSE2 coding?
Is there some value of N below which vectorization would be counter productive?
What is reason '1105'?

Contraposition answered 8/6, 2013 at 14:25 Comment(0)

C

4

From the MSDN documentation, the 1105 error code means the compiler is not able to figure out how to reduce the code to vectorized instructions. For floating point operations it is indicated that you need to specify the /fp:fast option to enable any floating point reductions at all.

Cryptozoic answered 9/6, 2013 at 1:16 Comment(3)

+1 for the the /fp:fast option - it does cause the code to be vectorzed. Thanks! However your code (even with /fp:fast) actually takes twice as long as my original code (without /fp:fast). The new[]/delete[] is a big hit. Replacing the heap allocated dist by a stack allocated buffer cuts out half of the time to bring me excution time back to where I started. Looping over the data twice is also a big hit. It turns out the best thing to do was keep my original code and use /fp:fast for a 2x speed up. – Contraposition 9/6, 2013 at 5:25

Also thanks for the page on 1105 error code in MSDN, not sure why I couldn't find it (I had actually been to that page!). If you remove your code I will accept this answer. My original code with /fp:fast is my preferred solution. – Contraposition 9/6, 2013 at 5:45

@user2151446 I've removed the code, though in my test case the code from your example did not vectorize, hence the added complexity. I'll be putting together a question about that tomorrow. – Cryptozoic 9/6, 2013 at 7:19

A

6

It's pretty straightforward to implement this using SSE intrinsics:

#include "pmmintrin.h"

__m128 vd2 = _mm_set1_ps(0.0f);
float d2 = 0.0f;
int k;

// process 4 elements per iteration
for (k = 0; k < N - 3; k += 4)
{
    __m128 va = _mm_loadu_ps(&a[k]);
    __m128 vb = _mm_loadu_ps(&b[k]);
    __m128 vd = _mm_sub_ps(va, vb);
    vd = _mm_mul_ps(vd, vd);
    vd2 = _mm_add_ps(vd2, vd);
}

// horizontal sum of 4 partial dot products
vd2 = _mm_hadd_ps(vd2, vd2);
vd2 = _mm_hadd_ps(vd2, vd2);
_mm_store_ss(&d2, vd2);

// clean up any remaining elements
for ( ; k < N; ++k)
{
    float d = a[k] - b[k];
    d2 += d * d;
}

Note that if you can guarantee that a and b are 16 byte aligned then you can use _mm_load_ps rather than _mm_loadu_ps which may help performance, particularly on older (pre Nehalem) CPUs.

Note also that for loops such as this where the are very few arithmetic instructions relative to the number of loads then performance may well be limited by memory bandwidth and the expected speed-up from vectorization may not be realised in practice.

Antibiosis answered 8/6, 2013 at 16:7 Comment(7)

Good answer! Like I would have done it myself. However I would suggest to @user2151446 that you make sure you allocate a and b with alignment of 16 bytes. In that way you could use _mm_load_ps instead of _mm_loadu_ps. That will optimize the code even further. – Lepidote 8/6, 2013 at 22:22

Thanks. This worked well, although only as fast as my original code with the /fp:fast option. If I specify /fp:fast, it is important to put #pragma loop(no_vector) on the second loop to stop VS trying to vectorize it and losing half the gain from the vectorization of the first loop. – Contraposition 9/6, 2013 at 5:28

@oysteijo: thanks - I've now added a note about aligned loads. – Antibiosis 9/6, 2013 at 7:13

@user2151446: you may not see much benefit from SSE if your vectors are large and you become limited by memory bandwidth - I've added a note about this in the answer now. – Antibiosis 9/6, 2013 at 7:15

I like your use of (N - 3). I usually use N & -4 but N - 3 works just as well. – Halm 10/6, 2013 at 11:13

@raxman: actually I'm torn between using for (i = 0; i < N - 3; i += 4) and for (i = 0; i <= N - 4; i += 4) - the first one fits better with the general for loop idiom, while the second makes it easier to adapt to different vector sizes (e.g. change both occurrences of 4 to 8). – Antibiosis 10/6, 2013 at 11:28

I usually use #define ROUND_UP(x, s) (((x)+((s)-1)) & -(s)) #define ROUND_DOWN(x, s) ((x) & -(s)). I use ROUND_UP to allocate arrays so I don't have to worry about going out of bounds and I use ROUND_DOWN when I need to worry about going out of bounds. But your method I think works perfectly fine. I mean you would just do (i = 0; i <= N - 7; i += 8) for an 8-vector. – Halm 10/6, 2013 at 11:36

C

4

From the MSDN documentation, the 1105 error code means the compiler is not able to figure out how to reduce the code to vectorized instructions. For floating point operations it is indicated that you need to specify the /fp:fast option to enable any floating point reductions at all.

Cryptozoic answered 9/6, 2013 at 1:16 Comment(3)

+1 for the the /fp:fast option - it does cause the code to be vectorzed. Thanks! However your code (even with /fp:fast) actually takes twice as long as my original code (without /fp:fast). The new[]/delete[] is a big hit. Replacing the heap allocated dist by a stack allocated buffer cuts out half of the time to bring me excution time back to where I started. Looping over the data twice is also a big hit. It turns out the best thing to do was keep my original code and use /fp:fast for a 2x speed up. – Contraposition 9/6, 2013 at 5:25

Also thanks for the page on 1105 error code in MSDN, not sure why I couldn't find it (I had actually been to that page!). If you remove your code I will accept this answer. My original code with /fp:fast is my preferred solution. – Contraposition 9/6, 2013 at 5:45

@user2151446 I've removed the code, though in my test case the code from your example did not vectorize, hence the added complexity. I'll be putting together a question about that tomorrow. – Cryptozoic 9/6, 2013 at 7:19

Recommended topics

Hot tags