Optimal SIMD algorithm to rotate or transpose an array

Asked 19/11, 2014 at 9:32 Answered 20/11, 2014 at 10:21

Solved assembly intel simd transpose avx2

I am working on a data structure where I have an array of 16 uint64. They are laid out like this in memory (each below representing a single int64):

A0 A1 A2 A3 B0 B1 B2 B3 C0 C1 C2 C3 D0 D1 D2 D3

The desired result is to transpose the array into this:

A0 B0 C0 D0 A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3

The rotation of the array 90 degrees is also an acceptable solution for the future loop:

D0 C0 B0 A0 D1 C1 B1 A1 D2 C2 B2 A2 D3 C3 B3 A3

I need this in order to operate on the arrow fast at a later point (Traverse it sequentially with another SIMD trip, 4 at a time).

So far, I have tried to "blend" the data by loading up a 4 x 64 bit vector of A's, bitmaskising and shuffling the elements and OR'ing it with B's etc and then repeating that for C's... Unfortunately, this is 5 x 4 SIMD instructions per segment of 4 elements in the array (one load, one mask, one shuffle, one or with next element and finally a store). It seems I should be able to do better.

I have AVX2 available and I a compiling with clang.

Sissy answered 19/11, 2014 at 9:32 Comment(6)

C1 C1 is this a typo? Please show the correct output. – Boisterous 19/11, 2014 at 9:42

Sorry, typo...Yes, I want to transpose the matrix (rotate it 90 degrees) – Sissy 19/11, 2014 at 9:49

Let me see if I understand your question. You want to transpose a 4x4 matrix of uint64? – Kinnikinnick 19/11, 2014 at 9:50

@Zboson: Yes, that is another way of viewing it. – Sissy 19/11, 2014 at 9:50

Transpose is not the same as rotation by 90 degrees - based on your example it looks like you want the former rather than the latter. – Protestant 19/11, 2014 at 13:55

@PaulR: You are right. Actually, I can use either a 90 rotation or a transpose. I will update the question. – Sissy 19/11, 2014 at 14:45

uint64_t A[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
__m256i row0 = _mm256_loadu_si256((__m256i*)&A[ 0]); //0 1 2 3
__m256i row1 = _mm256_loadu_si256((__m256i*)&A[ 4]); //4 5 6 7
__m256i row2 = _mm256_loadu_si256((__m256i*)&A[ 8]); //8 9 a b
__m256i row3 = _mm256_loadu_si256((__m256i*)&A[12]); //c d e f

I don't have hardware to test this on right now but something like the following should do what you want

__m256i tmp3, tmp2, tmp1, tmp0;
tmp0 = _mm256_unpacklo_epi64(row0, row1);            //0 4 2 6
tmp1 = _mm256_unpackhi_epi64(row0, row1);            //1 5 3 7
tmp2 = _mm256_unpacklo_epi64(row2, row3);            //8 c a e
tmp3 = _mm256_unpackhi_epi64(row2, row3);            //9 d b f
//now select the appropriate 128-bit lanes
row0 = _mm256_permute2x128_si256(tmp0, tmp2, 0x20);  //0 4 8 c
row1 = _mm256_permute2x128_si256(tmp1, tmp3, 0x20);  //1 5 9 d
row2 = _mm256_permute2x128_si256(tmp0, tmp2, 0x31);  //2 6 a e
row3 = _mm256_permute2x128_si256(tmp1, tmp3, 0x31);  //3 7 b f

The

__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm)

intrinsic selects 128-bit lanes from two sources. You can read about it in the Intel Intrinsic Guide. There is a version _mm256_permute2f128_si256 which only needs AVX and acts in the floating point domain. I used this to check that I used the correct control words.

Kinnikinnick answered 19/11, 2014 at 10:28 Comment(2)

+1: nice transpose - I fixed a few minor bugs in the code and comments and it's now tested and verified on a Haswell CPU. – Protestant 19/11, 2014 at 14:30

@Zboson: This is an awesome solution. 8 instructions! I wonder if it can be done in less with a 90 degree rotate (which is also an acceptable layout of the destination array) – Sissy 19/11, 2014 at 15:25

An alternative is to use the gather instructions, you can load directly the transposed matrix. The five lines of code below are ok with gcc on a i7-Haswell.

  int32_t stride = 4 * sizeof(A[0]);
  __m128i r128_gather_idx = _mm_set_epi32(3 * stride, 2 * stride, 1 * stride, 0 * stride);
  __m256i row0 = _mm256_i32gather_epi64(reinterpret_cast<long long const *>(&A[ 0]), r128_gather_idx, 1);
  __m256i row1 = _mm256_i32gather_epi64(reinterpret_cast<long long const *>(&A[ 1]), r128_gather_idx, 1);
  __m256i row2 = _mm256_i32gather_epi64(reinterpret_cast<long long const *>(&A[ 2]), r128_gather_idx, 1);

Celestial answered 20/11, 2014 at 10:21 Comment(3)

On Haswell, gather provides functionality, but not much performance (this may change on future µarches, of course). Basically, anytime you can do the same operation with fixed permutes, you should do so. – Nutlet 20/11, 2014 at 16:38

I saw about a 2x slowdown on the gather vs. the fixed permute. So @Zbosons answer is the fastest. Nice to have this for completeness though. – Sissy 20/11, 2014 at 17:21

I did not think of using gather. That's an interesting idea. – Kinnikinnick 21/11, 2014 at 8:40

Recommended topics

Hot tags