Transpose an 8x8 float using AVX/AVX2

Asked 2/9, 2014 at 11:51 Answered 30/10, 2018 at 3:16

Transposing a 8x8 matrix can be achieved by making four 4x4 matrices, and transposing each of them. This is not want I'm going for.

In another question, one answer gave a solution that would only require 24 instructions for an 8x8 matrix. However, this does not apply to floats.

Since the AVX2 contains registers of 256 bits, each register would fit eight 32 bits integers (floats). But the question is:

How to transpose an 8x8 float matrix, using AVX/AVX2, with the smallest instructions possible?

Laager answered 2/9, 2014 at 11:51 Comment(2)

Hei David. Jeg tror jeg ga deg det eksakte svaret til spørsmålet ditt. Hva mer vil du ha? – Poulos 3/9, 2014 at 20:50

@Zboson: I've been really busy the past days, but just marked your answer as accepted. I greatly appreciate it! – Laager 3/9, 2014 at 21:36

I already answered this question Fast memory transpose with SSE, AVX, and OpenMP.

Let me repeat the solution for transposing an 8x8 float matrix with AVX. Let me know if this is any faster than using 4x4 blocks and _MM_TRANSPOSE4_PS. I used it for a kernel in a larger matrix transpose which was memory bound so that was probably not a fair test.

inline void transpose8_ps(__m256 &row0, __m256 &row1, __m256 &row2, __m256 &row3, __m256 &row4, __m256 &row5, __m256 &row6, __m256 &row7) {
__m256 __t0, __t1, __t2, __t3, __t4, __t5, __t6, __t7;
__m256 __tt0, __tt1, __tt2, __tt3, __tt4, __tt5, __tt6, __tt7;
__t0 = _mm256_unpacklo_ps(row0, row1);
__t1 = _mm256_unpackhi_ps(row0, row1);
__t2 = _mm256_unpacklo_ps(row2, row3);
__t3 = _mm256_unpackhi_ps(row2, row3);
__t4 = _mm256_unpacklo_ps(row4, row5);
__t5 = _mm256_unpackhi_ps(row4, row5);
__t6 = _mm256_unpacklo_ps(row6, row7);
__t7 = _mm256_unpackhi_ps(row6, row7);
__tt0 = _mm256_shuffle_ps(__t0,__t2,_MM_SHUFFLE(1,0,1,0));
__tt1 = _mm256_shuffle_ps(__t0,__t2,_MM_SHUFFLE(3,2,3,2));
__tt2 = _mm256_shuffle_ps(__t1,__t3,_MM_SHUFFLE(1,0,1,0));
__tt3 = _mm256_shuffle_ps(__t1,__t3,_MM_SHUFFLE(3,2,3,2));
__tt4 = _mm256_shuffle_ps(__t4,__t6,_MM_SHUFFLE(1,0,1,0));
__tt5 = _mm256_shuffle_ps(__t4,__t6,_MM_SHUFFLE(3,2,3,2));
__tt6 = _mm256_shuffle_ps(__t5,__t7,_MM_SHUFFLE(1,0,1,0));
__tt7 = _mm256_shuffle_ps(__t5,__t7,_MM_SHUFFLE(3,2,3,2));
row0 = _mm256_permute2f128_ps(__tt0, __tt4, 0x20);
row1 = _mm256_permute2f128_ps(__tt1, __tt5, 0x20);
row2 = _mm256_permute2f128_ps(__tt2, __tt6, 0x20);
row3 = _mm256_permute2f128_ps(__tt3, __tt7, 0x20);
row4 = _mm256_permute2f128_ps(__tt0, __tt4, 0x31);
row5 = _mm256_permute2f128_ps(__tt1, __tt5, 0x31);
row6 = _mm256_permute2f128_ps(__tt2, __tt6, 0x31);
row7 = _mm256_permute2f128_ps(__tt3, __tt7, 0x31);
}

Based on this comment I learned that there are more efficient methods which to do the 8x8 transpose. See Example 11-19 and and 11-20 in the Intel optimization manual under section "11.11 Handling Port 5 Pressure". Example 11-19 uses the same number of instructions but reduces the pressure on port5 by using blends which go to port0 as well. I may implement this with intrinsics at some point but I don't have a need for this at this point.

I looked more carefully into Example 11-19 and 11-20 in the Intel Manuals I mentioned above. It turns out that example 11-19 uses 4 more shuffle operations than necessary. It has 8 unpack, 12 shuffles, and 8 128-bit permutes. My method uses 4 fewer shuffles. They replace 8 of the shuffles with blends. So 4 shuffles and 8 blends. I doubt that's better than my method with only eight shuffles.

Example 11-20 is, however, an improvement if you need to load the matrix from memory. This uses 8 unpacks, 8 inserts, 8 shuffles, 8 128-bit loads, and 8 stores. The 128-bit loads reduce the port pressure. I went ahead and implemented this using intrinsics.

//Example 11-20. 8x8 Matrix Transpose Using VINSERTF128 loads
void tran(float* mat, float* matT) {
  __m256  r0, r1, r2, r3, r4, r5, r6, r7;
  __m256  t0, t1, t2, t3, t4, t5, t6, t7;

  r0 = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_load_ps(&mat[0*8+0])), _mm_load_ps(&mat[4*8+0]), 1);
  r1 = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_load_ps(&mat[1*8+0])), _mm_load_ps(&mat[5*8+0]), 1);
  r2 = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_load_ps(&mat[2*8+0])), _mm_load_ps(&mat[6*8+0]), 1);
  r3 = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_load_ps(&mat[3*8+0])), _mm_load_ps(&mat[7*8+0]), 1);
  r4 = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_load_ps(&mat[0*8+4])), _mm_load_ps(&mat[4*8+4]), 1);
  r5 = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_load_ps(&mat[1*8+4])), _mm_load_ps(&mat[5*8+4]), 1);
  r6 = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_load_ps(&mat[2*8+4])), _mm_load_ps(&mat[6*8+4]), 1);
  r7 = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_load_ps(&mat[3*8+4])), _mm_load_ps(&mat[7*8+4]), 1);

  t0 = _mm256_unpacklo_ps(r0,r1);
  t1 = _mm256_unpackhi_ps(r0,r1);
  t2 = _mm256_unpacklo_ps(r2,r3);
  t3 = _mm256_unpackhi_ps(r2,r3);
  t4 = _mm256_unpacklo_ps(r4,r5);
  t5 = _mm256_unpackhi_ps(r4,r5);
  t6 = _mm256_unpacklo_ps(r6,r7);
  t7 = _mm256_unpackhi_ps(r6,r7);

  r0 = _mm256_shuffle_ps(t0,t2, 0x44);
  r1 = _mm256_shuffle_ps(t0,t2, 0xEE);
  r2 = _mm256_shuffle_ps(t1,t3, 0x44);
  r3 = _mm256_shuffle_ps(t1,t3, 0xEE);
  r4 = _mm256_shuffle_ps(t4,t6, 0x44);
  r5 = _mm256_shuffle_ps(t4,t6, 0xEE);
  r6 = _mm256_shuffle_ps(t5,t7, 0x44);
  r7 = _mm256_shuffle_ps(t5,t7, 0xEE);

  _mm256_store_ps(&matT[0*8], r0);
  _mm256_store_ps(&matT[1*8], r1);
  _mm256_store_ps(&matT[2*8], r2);
  _mm256_store_ps(&matT[3*8], r3);
  _mm256_store_ps(&matT[4*8], r4);
  _mm256_store_ps(&matT[5*8], r5);
  _mm256_store_ps(&matT[6*8], r6);
  _mm256_store_ps(&matT[7*8], r7);
}

So I looked into example 11-19 again. The basic idea as far as I can tell is that two shuffle instructions (shufps) can be replaced by one shuffle and two blends. For example

r0 = _mm256_shuffle_ps(t0,t2, 0x44);
r1 = _mm256_shuffle_ps(t0,t2, 0xEE);

can be replace with

v = _mm256_shuffle_ps(t0,t2, 0x4E);
r0 = _mm256_blend_ps(t0, v, 0xCC);
r1 = _mm256_blend_ps(t2, v, 0x33);

This explains why my original code used 8 shuffles and Example 11-19 uses 4 shuffles and eight blends.

The blends are good for throughput because shuffles only go to one port (creating a bottleneck on the shuffle port), but blends can run on multiple ports and thus don't compete. But what is better: 8 shuffles or 4 shuffles and 8 blends?

This has to be tested, and can depend on surrounding code. If you mostly bottleneck on total uop throughput with a lot of other uops in the loop that don't need port 5, you might go for the pure shuffle version. Ideally you should do some computation on the transposed data before storing it, while it's already in registers. See https://agner.org/optimize/ and other performance links in the x86 tag wiki.

I don't, however, see a way to replace the unpack instructions with blends.

Here is full code which combines Example 11-19 converting 2 shuffles to 1 shuffle and two blends and Example 11-20 which uses vinsertf128 loads (which on Intel Haswell/Skylake CPUs are 2 uops: one ALU for any port, one memory. They unfortunately don't micro-fuse. vinsertf128 with all register operands is 1 uop for the shuffle port on Intel, so this is good because the compiler folds the load into a memory operand for vinsertf128.) This has the advantage of only needing the source data 16-byte aligned for maximum performance, avoiding any cache-line splits.

#include <stdio.h>
#include <x86intrin.h>
#include <omp.h>   

/*
void tran(float* mat, float* matT) {
  __m256  r0, r1, r2, r3, r4, r5, r6, r7;
  __m256  t0, t1, t2, t3, t4, t5, t6, t7;

  r0 = _mm256_load_ps(&mat[0*8]);
  r1 = _mm256_load_ps(&mat[1*8]);
  r2 = _mm256_load_ps(&mat[2*8]);
  r3 = _mm256_load_ps(&mat[3*8]);
  r4 = _mm256_load_ps(&mat[4*8]);
  r5 = _mm256_load_ps(&mat[5*8]);
  r6 = _mm256_load_ps(&mat[6*8]);
  r7 = _mm256_load_ps(&mat[7*8]);

  t0 = _mm256_unpacklo_ps(r0, r1);
  t1 = _mm256_unpackhi_ps(r0, r1);
  t2 = _mm256_unpacklo_ps(r2, r3);
  t3 = _mm256_unpackhi_ps(r2, r3);
  t4 = _mm256_unpacklo_ps(r4, r5);
  t5 = _mm256_unpackhi_ps(r4, r5);
  t6 = _mm256_unpacklo_ps(r6, r7);
  t7 = _mm256_unpackhi_ps(r6, r7);

  r0 = _mm256_shuffle_ps(t0,t2,_MM_SHUFFLE(1,0,1,0));  
  r1 = _mm256_shuffle_ps(t0,t2,_MM_SHUFFLE(3,2,3,2));
  r2 = _mm256_shuffle_ps(t1,t3,_MM_SHUFFLE(1,0,1,0));
  r3 = _mm256_shuffle_ps(t1,t3,_MM_SHUFFLE(3,2,3,2));
  r4 = _mm256_shuffle_ps(t4,t6,_MM_SHUFFLE(1,0,1,0));
  r5 = _mm256_shuffle_ps(t4,t6,_MM_SHUFFLE(3,2,3,2));
  r6 = _mm256_shuffle_ps(t5,t7,_MM_SHUFFLE(1,0,1,0));
  r7 = _mm256_shuffle_ps(t5,t7,_MM_SHUFFLE(3,2,3,2));

  t0 = _mm256_permute2f128_ps(r0, r4, 0x20);
  t1 = _mm256_permute2f128_ps(r1, r5, 0x20);
  t2 = _mm256_permute2f128_ps(r2, r6, 0x20);
  t3 = _mm256_permute2f128_ps(r3, r7, 0x20);
  t4 = _mm256_permute2f128_ps(r0, r4, 0x31);
  t5 = _mm256_permute2f128_ps(r1, r5, 0x31);
  t6 = _mm256_permute2f128_ps(r2, r6, 0x31);
  t7 = _mm256_permute2f128_ps(r3, r7, 0x31);

  _mm256_store_ps(&matT[0*8], t0);
  _mm256_store_ps(&matT[1*8], t1);
  _mm256_store_ps(&matT[2*8], t2);
  _mm256_store_ps(&matT[3*8], t3);
  _mm256_store_ps(&matT[4*8], t4);
  _mm256_store_ps(&matT[5*8], t5);
  _mm256_store_ps(&matT[6*8], t6);
  _mm256_store_ps(&matT[7*8], t7);
}
*/

void tran(float* mat, float* matT) {
  __m256  r0, r1, r2, r3, r4, r5, r6, r7;
  __m256  t0, t1, t2, t3, t4, t5, t6, t7;

  r0 = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_load_ps(&mat[0*8+0])), _mm_load_ps(&mat[4*8+0]), 1);
  r1 = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_load_ps(&mat[1*8+0])), _mm_load_ps(&mat[5*8+0]), 1);
  r2 = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_load_ps(&mat[2*8+0])), _mm_load_ps(&mat[6*8+0]), 1);
  r3 = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_load_ps(&mat[3*8+0])), _mm_load_ps(&mat[7*8+0]), 1);
  r4 = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_load_ps(&mat[0*8+4])), _mm_load_ps(&mat[4*8+4]), 1);
  r5 = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_load_ps(&mat[1*8+4])), _mm_load_ps(&mat[5*8+4]), 1);
  r6 = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_load_ps(&mat[2*8+4])), _mm_load_ps(&mat[6*8+4]), 1);
  r7 = _mm256_insertf128_ps(_mm256_castps128_ps256(_mm_load_ps(&mat[3*8+4])), _mm_load_ps(&mat[7*8+4]), 1);

  t0 = _mm256_unpacklo_ps(r0,r1);
  t1 = _mm256_unpackhi_ps(r0,r1);
  t2 = _mm256_unpacklo_ps(r2,r3);
  t3 = _mm256_unpackhi_ps(r2,r3);
  t4 = _mm256_unpacklo_ps(r4,r5);
  t5 = _mm256_unpackhi_ps(r4,r5);
  t6 = _mm256_unpacklo_ps(r6,r7);
  t7 = _mm256_unpackhi_ps(r6,r7);

  __m256 v;

  //r0 = _mm256_shuffle_ps(t0,t2, 0x44);
  //r1 = _mm256_shuffle_ps(t0,t2, 0xEE);  
  v = _mm256_shuffle_ps(t0,t2, 0x4E);
  r0 = _mm256_blend_ps(t0, v, 0xCC);
  r1 = _mm256_blend_ps(t2, v, 0x33);

  //r2 = _mm256_shuffle_ps(t1,t3, 0x44);
  //r3 = _mm256_shuffle_ps(t1,t3, 0xEE);
  v = _mm256_shuffle_ps(t1,t3, 0x4E);
  r2 = _mm256_blend_ps(t1, v, 0xCC);
  r3 = _mm256_blend_ps(t3, v, 0x33);

  //r4 = _mm256_shuffle_ps(t4,t6, 0x44);
  //r5 = _mm256_shuffle_ps(t4,t6, 0xEE);
  v = _mm256_shuffle_ps(t4,t6, 0x4E);
  r4 = _mm256_blend_ps(t4, v, 0xCC);
  r5 = _mm256_blend_ps(t6, v, 0x33);

  //r6 = _mm256_shuffle_ps(t5,t7, 0x44);
  //r7 = _mm256_shuffle_ps(t5,t7, 0xEE);
  v = _mm256_shuffle_ps(t5,t7, 0x4E);
  r6 = _mm256_blend_ps(t5, v, 0xCC);
  r7 = _mm256_blend_ps(t7, v, 0x33);

  _mm256_store_ps(&matT[0*8], r0);
  _mm256_store_ps(&matT[1*8], r1);
  _mm256_store_ps(&matT[2*8], r2);
  _mm256_store_ps(&matT[3*8], r3);
  _mm256_store_ps(&matT[4*8], r4);
  _mm256_store_ps(&matT[5*8], r5);
  _mm256_store_ps(&matT[6*8], r6);
  _mm256_store_ps(&matT[7*8], r7);
}


int verify(float *mat) {
    int i,j;
    int error = 0;
    for(i=0; i<8; i++) {
      for(j=0; j<8; j++) {
        if(mat[j*8+i] != 1.0f*i*8+j) error++;
      }
    }
    return error;
}

void print_mat(float *mat) {
    int i,j;
    for(i=0; i<8; i++) {
      for(j=0; j<8; j++) printf("%2.0f ", mat[i*8+j]);
      puts("");
    }
    puts("");
}

int main(void) {
    int i,j, rep;
    float mat[64] __attribute__((aligned(64)));
    float matT[64] __attribute__((aligned(64)));
    double dtime;

    rep = 10000000;
    for(i=0; i<64; i++) mat[i] = i;
    print_mat(mat);

    tran(mat,matT);
    //dtime = -omp_get_wtime();
    //tran(mat, matT, rep);
    //dtime += omp_get_wtime();
    printf("errors %d\n", verify(matT));
    //printf("dtime %f\n", dtime);
    print_mat(matT);
}

Poulos answered 2/9, 2014 at 15:54 Comment(21)

I have one of these for doubles on AVX512. I think it might be possible prove that an 8x8 transpose can't be done in fewer than 24 operations if they are limited to 2 input operands. – Cigar 2/9, 2014 at 17:47

@Mysticial, woah, how are you using AVX512? Xeon Phi? Emulator? I'm envious. For _MM_TRANSPOSE4_PS MSVC uses _mm_shuffle_ps and GCC/ICC use mm_unpacklo_ps/mm_unpackhi_ps/mm_movelh_ps/mm_movehl_ps. Do you know why this is? – Poulos 2/9, 2014 at 18:6

Compile-time emulation. For testing, I have a set of headers that imitate a subset of the AVX512f intrinsics using only AVX and FMA3 so I can run it on both my Intel and AMD machines. (performance-wise, it's reasonably fast to run on a Haswell - only about 30% slower than a native AVX2 implementation) I can also compile to Knights Landing AVX512f. While I can't run it, I can at least look at the assembly to see if everything looks sane. – Cigar 2/9, 2014 at 18:14

Intel’s SDE also provides emulation to support development of AVX-512 software. – Ventris 2/9, 2014 at 18:19

@StephenCanon Yeah, I'm aware of that. I just never bothered to try it since what I have now fits well into my Visual Studio workflow. I did the same thing with AVX before Sandy Bridge hit the market. – Cigar 2/9, 2014 at 18:22

As this is the answer using the smallest amount of operations, I'm going to mark it as accepted. Thanks a lot guys! – Laager 3/9, 2014 at 21:35

Unpacks are a type of shuffles (at least in the sense that they run on the shuffle port). When you compare your code to Intel's examples, it would probably be better to count unpacks as shuffles, since they probably both bottleneck on shuffle-port throughput. – Vasty 27/12, 2016 at 22:13

@PeterCordes, yeah I assumed unpacks were like shuffles. I counted instructions and Intel uses the same number of unpacks as I do. My implementation of their Example 11-20 using intrinsics uses exactly the same instructions as they do. But their Example 11-19 uses 4 more shuffles than mine but the same number of unpacks. They convert 8 of their shuffles to blend but still have four shuffles. So mine is eight shuffles and theirs is 4 shuffles and 8 blends. What do you think is better? From a throughput point of view it is equal 1*8 = 1*4+ 0.5*8. – Poulos 28/12, 2016 at 6:35

@PeterCordes, I have tried to come up with ways to convert unpacks or shuffles to blends in 4x4, 8x8, and 16x16 transposes but so far it always requires extra instructions to use blends. I guess on KNL it might be okay as a shuffles/unpacks has a rthroughput of 2 and a blend 0.5 so replacing e.g. 4 shuffles with 8 blends could make sense. Let me look into that. – Poulos 28/12, 2016 at 6:39

You mean yours is 16 (8+8) shuffles, and theirs is 12 shuffles + 8 blends, because unpacks are a kind of shuffle. Writing this way makes it clearer how much shuffle-port pressure there is. on Intel HSW / SKL, immediate-blends don't compete much with shuffles, so for throughput of just this operation, Intel's version could win. If you're doing FP math on the transposed matrices, that's mostly port 0/port 1 for FMA/mul/add, so doing the transpose with only port 5 (and vinsertf load ports) could be a win for overall throughput including the other code. – Vasty 28/12, 2016 at 6:46

@PeterCordes, two shuffles can be replaced by one shuffle and two blends like this. Let r1=AB and r2=CD. We want t1=AC and t2=BD. Let v=shuffle(r1,r2)=BC. Then t1=blend(v,r1)=AC and t2=blend(v,r2)=BC. That works but it has a dependency on v so I think it would bottleneck on latency. – Poulos 28/12, 2016 at 7:9

@PeterCordes, I look at Intel's assembly code in Example 11-19 and sure enough their blends which follow the shuffle depend on the previous shuffle (vshuf ps ymm3, ymm6, ymm1, 0x4E vblendps ymm10, ymm6, ymm3, 0xCC). Won't this be bound by latency? I think the latency on shuffles on KNL is large. Agner puts it at 4-7. – Poulos 28/12, 2016 at 7:16

@PeterCordes, I guess their are several other independent operations to fill up the pipeline so perhaps it does not bottleneck on latency. – Poulos 28/12, 2016 at 7:19

On a CPU with one shuffle port, the second shuffle has a resource conflict so it can't run in parallel with the first shuffle anyway. On Haswell, the shuffle + 2 blends can produce both results in the second cycle, instead of one in the first cycle and one in the second. I haven't looked at it for KNL. But aren't there lots of independent shuffles to do in this transpose? KNL has enough out-of-order execution capability to exploit this ILP, and have execution catch up with the front-end before it runs out of ROB / RS resources, I'd guess. I'm not really looking at this in detail ATM :/ – Vasty 28/12, 2016 at 7:19

@PeterCordes, yeah I was being silly about the dependency. If you have a reduction and you unroll the loop by two each one still has a dependency with itself but they are independent of each other. So there are several operations which are independent to fill the pipeline. I think replacing 2 shuffles with 1 shuffle and 2 blends will have the largest effect on KNL because a shuffle on KNL has a rthroughput of 2 but a blend a rthroughput of 0.5 so I think one shuffle and two blends will clearly win. Let me try this. – Poulos 28/12, 2016 at 7:23

@PeterCordes, I just updated this answer converting 2 shuffles to 1 shuffle and 2 blends. One reason I count unpack and shuffle separately is that in some cases 2 shufps can be replace with shufps and blends but unpck/h cannot. – Poulos 28/12, 2016 at 9:47

It is good to have different versions of the 8x8 transpose, because the performance may depend on the actual surrounding code. For example: 1. Is the input from memory or from registers after inlining tran? 2. Is tran used as a building block inside a large nxn transpose procedure, or is it used inbetween heavy floating point code? Note that vinsertf128 only avoids port 5 if the inserted data is a memory operand. – Daugava 28/12, 2016 at 10:2

I'm be curious to see if timing results are anywhere near matching what we'd predict from Agner's port / throughput numbers. – Vasty 28/12, 2016 at 10:36

gcc optimizes _mm256_permute2f128_ps from memory operands into vinsertf128 loads. See godbolt.org/g/bEwS4r for results from @Andrew Hallendorff's version (which I wouldn't recommend using, see my comments on his answer.) – Vasty 17/8, 2018 at 3:4

clang actually drops all shuffle's: 8x vinsertf128, 16x vmovaps, 4x vunpckhpd, 4x vunpckhps, 4x vunpcklpd, 4x vunpcklps – Alack 2/2, 2021 at 8:29

In my experience with 8x8 transposes, you are sometimes better off doing them as some parallel 4x4 transposes, and dealing with the top-right, bottom-left quadrant swap by just splitting the 32 byte stores up into two 16 byte stores. This takes some pressure off the shuffle vALUs. Obviously this is especially true for 16B vector arch’s, but sometimes applies to AVX2 as well. – Foregut 31/8, 2023 at 6:44

Here's an AVX2 solution which works for 8 x 8 32 bit ints. You can of course cast float vectors to int and back if you want to transpose 8 x 8 floats. It might also be possible to do an AVX-only version (i.e. not requiring AVX2) just for floats but I haven't tried that yet.

//
// tranpose_8_8_avx2.c
//

#include <stdio.h>

#include <immintrin.h>

#define V_ELEMS 8

static inline void _mm256_merge_epi32(const __m256i v0, const __m256i v1, __m256i *vl, __m256i *vh)
{
    __m256i va = _mm256_permute4x64_epi64(v0, _MM_SHUFFLE(3, 1, 2, 0));
    __m256i vb = _mm256_permute4x64_epi64(v1, _MM_SHUFFLE(3, 1, 2, 0));
    *vl = _mm256_unpacklo_epi32(va, vb);
    *vh = _mm256_unpackhi_epi32(va, vb);
}

static inline void _mm256_merge_epi64(const __m256i v0, const __m256i v1, __m256i *vl, __m256i *vh)
{
    __m256i va = _mm256_permute4x64_epi64(v0, _MM_SHUFFLE(3, 1, 2, 0));
    __m256i vb = _mm256_permute4x64_epi64(v1, _MM_SHUFFLE(3, 1, 2, 0));
    *vl = _mm256_unpacklo_epi64(va, vb);
    *vh = _mm256_unpackhi_epi64(va, vb);
}

static inline void _mm256_merge_si128(const __m256i v0, const __m256i v1, __m256i *vl, __m256i *vh)
{
    *vl = _mm256_permute2x128_si256(v0, v1, _MM_SHUFFLE(0, 2, 0, 0));
    *vh = _mm256_permute2x128_si256(v0, v1, _MM_SHUFFLE(0, 3, 0, 1));
}

//
// Transpose_8_8
//
// in place transpose of 8 x 8 int array
//

static void Transpose_8_8(
    __m256i *v0,
    __m256i *v1,
    __m256i *v2,
    __m256i *v3,
    __m256i *v4,
    __m256i *v5,
    __m256i *v6,
    __m256i *v7)
{
    __m256i w0, w1, w2, w3, w4, w5, w6, w7;
    __m256i x0, x1, x2, x3, x4, x5, x6, x7;

    _mm256_merge_epi32(*v0, *v1, &w0, &w1);
    _mm256_merge_epi32(*v2, *v3, &w2, &w3);
    _mm256_merge_epi32(*v4, *v5, &w4, &w5);
    _mm256_merge_epi32(*v6, *v7, &w6, &w7);

    _mm256_merge_epi64(w0, w2, &x0, &x1);
    _mm256_merge_epi64(w1, w3, &x2, &x3);
    _mm256_merge_epi64(w4, w6, &x4, &x5);
    _mm256_merge_epi64(w5, w7, &x6, &x7);

    _mm256_merge_si128(x0, x4, v0, v1);
    _mm256_merge_si128(x1, x5, v2, v3);
    _mm256_merge_si128(x2, x6, v4, v5);
    _mm256_merge_si128(x3, x7, v6, v7);
}

int main(void)
{
    int32_t buff[V_ELEMS][V_ELEMS] __attribute__ ((aligned(32)));
    int i, j;
    int k = 0;

    // init buff
    for (i = 0; i < V_ELEMS; ++i)
    {
        for (j = 0; j < V_ELEMS; ++j)
        {
            buff[i][j] = k++;
        }
    }

    // print buff
    printf("\nBEFORE:\n");
    for (i = 0; i < V_ELEMS; ++i)
    {
        for (j = 0; j < V_ELEMS; ++j)
        {
            printf("%4d", buff[i][j]);
        }
        printf("\n");
    }

    // transpose
    Transpose_8_8((__m256i *)buff[0], (__m256i *)buff[1], (__m256i *)buff[2], (__m256i *)buff[3], (__m256i *)buff[4], (__m256i *)buff[5], (__m256i *)buff[6], (__m256i *)buff[7]);

    // print buff
    printf("\nAFTER:\n");
    for (i = 0; i < V_ELEMS; ++i)
    {
        for (j = 0; j < V_ELEMS; ++j)
        {
            printf("%4d", buff[i][j]);
        }
        printf("\n");
    }

    // transpose
    Transpose_8_8((__m256i *)buff[0], (__m256i *)buff[1], (__m256i *)buff[2], (__m256i *)buff[3], (__m256i *)buff[4], (__m256i *)buff[5], (__m256i *)buff[6], (__m256i *)buff[7]);

    // print buff
    printf("\nAFTER x2:\n");
    for (i = 0; i < V_ELEMS; ++i)
    {
        for (j = 0; j < V_ELEMS; ++j)
        {
            printf("%4d", buff[i][j]);
        }
        printf("\n");
    }

    return 0;
}

Transpose_8_8 compiles to around 56 instructions with clang, including loads and stores - I think it should be possible to improve on this with some more effort.

Compile and test:

$ gcc -Wall -mavx2 -O3 transpose_8_8_avx2.c && ./a.out

BEFORE:
   0   1   2   3   4   5   6   7
   8   9  10  11  12  13  14  15
  16  17  18  19  20  21  22  23
  24  25  26  27  28  29  30  31
  32  33  34  35  36  37  38  39
  40  41  42  43  44  45  46  47
  48  49  50  51  52  53  54  55
  56  57  58  59  60  61  62  63

AFTER:
   0   8  16  24  32  40  48  56
   1   9  17  25  33  41  49  57
   2  10  18  26  34  42  50  58
   3  11  19  27  35  43  51  59
   4  12  20  28  36  44  52  60
   5  13  21  29  37  45  53  61
   6  14  22  30  38  46  54  62
   7  15  23  31  39  47  55  63

AFTER x2:
   0   1   2   3   4   5   6   7
   8   9  10  11  12  13  14  15
  16  17  18  19  20  21  22  23
  24  25  26  27  28  29  30  31
  32  33  34  35  36  37  38  39
  40  41  42  43  44  45  46  47
  48  49  50  51  52  53  54  55
  56  57  58  59  60  61  62  63

$

Mercorr answered 2/9, 2014 at 14:31 Comment(0)

I decided to do a full test of 3 different routines in an apples to apples comparison.

  // transpose.cpp : 
  /*
  Transpose8x8Shuff 100,000,000 times
  (0.750000 seconds).
  Transpose8x8Permute 100,000,000 times
  (0.749000 seconds).
  Transpose8x8Insert 100,000,000 times
  (0.858000 seconds).
  */

  #include <stdio.h>
  #include <time.h>
  #include <thread>
  #include <chrono>
  #include <xmmintrin.h>
  #include <emmintrin.h>
  #include <tmmintrin.h>
  #include <immintrin.h>

  inline void Transpose8x8Shuff(unsigned long *in)
  {
     __m256 *inI = reinterpret_cast<__m256 *>(in);
     __m256 rI[8];
     rI[0] = _mm256_unpacklo_ps(inI[0], inI[1]); 
     rI[1] = _mm256_unpackhi_ps(inI[0], inI[1]); 
     rI[2] = _mm256_unpacklo_ps(inI[2], inI[3]); 
     rI[3] = _mm256_unpackhi_ps(inI[2], inI[3]); 
     rI[4] = _mm256_unpacklo_ps(inI[4], inI[5]); 
     rI[5] = _mm256_unpackhi_ps(inI[4], inI[5]); 
     rI[6] = _mm256_unpacklo_ps(inI[6], inI[7]); 
     rI[7] = _mm256_unpackhi_ps(inI[6], inI[7]); 

     __m256 rrF[8];
     __m256 *rF = reinterpret_cast<__m256 *>(rI);
     rrF[0] = _mm256_shuffle_ps(rF[0], rF[2], _MM_SHUFFLE(1,0,1,0));
     rrF[1] = _mm256_shuffle_ps(rF[0], rF[2], _MM_SHUFFLE(3,2,3,2));
     rrF[2] = _mm256_shuffle_ps(rF[1], rF[3], _MM_SHUFFLE(1,0,1,0)); 
     rrF[3] = _mm256_shuffle_ps(rF[1], rF[3], _MM_SHUFFLE(3,2,3,2));
     rrF[4] = _mm256_shuffle_ps(rF[4], rF[6], _MM_SHUFFLE(1,0,1,0));
     rrF[5] = _mm256_shuffle_ps(rF[4], rF[6], _MM_SHUFFLE(3,2,3,2));
     rrF[6] = _mm256_shuffle_ps(rF[5], rF[7], _MM_SHUFFLE(1,0,1,0));
     rrF[7] = _mm256_shuffle_ps(rF[5], rF[7], _MM_SHUFFLE(3,2,3,2));

     rF = reinterpret_cast<__m256 *>(in);
     rF[0] = _mm256_permute2f128_ps(rrF[0], rrF[4], 0x20);
     rF[1] = _mm256_permute2f128_ps(rrF[1], rrF[5], 0x20);
     rF[2] = _mm256_permute2f128_ps(rrF[2], rrF[6], 0x20);
     rF[3] = _mm256_permute2f128_ps(rrF[3], rrF[7], 0x20);
     rF[4] = _mm256_permute2f128_ps(rrF[0], rrF[4], 0x31);
     rF[5] = _mm256_permute2f128_ps(rrF[1], rrF[5], 0x31);
     rF[6] = _mm256_permute2f128_ps(rrF[2], rrF[6], 0x31);
     rF[7] = _mm256_permute2f128_ps(rrF[3], rrF[7], 0x31);
  }

  inline void Transpose8x8Permute(unsigned long *in)
  {
     __m256i *inI = reinterpret_cast<__m256i *>(in);
     __m256i rI[8];
     rI[0] = _mm256_permute2f128_si256(inI[0], inI[4], 0x20); 
     rI[1] = _mm256_permute2f128_si256(inI[0], inI[4], 0x31); 
     rI[2] = _mm256_permute2f128_si256(inI[1], inI[5], 0x20); 
     rI[3] = _mm256_permute2f128_si256(inI[1], inI[5], 0x31); 
     rI[4] = _mm256_permute2f128_si256(inI[2], inI[6], 0x20); 
     rI[5] = _mm256_permute2f128_si256(inI[2], inI[6], 0x31); 
     rI[6] = _mm256_permute2f128_si256(inI[3], inI[7], 0x20); 
     rI[7] = _mm256_permute2f128_si256(inI[3], inI[7], 0x31); 

     __m256 rrF[8];
     __m256 *rF = reinterpret_cast<__m256 *>(rI);
     rrF[0] = _mm256_unpacklo_ps(rF[0], rF[4]); 
     rrF[1] = _mm256_unpackhi_ps(rF[0], rF[4]); 
     rrF[2] = _mm256_unpacklo_ps(rF[1], rF[5]); 
     rrF[3] = _mm256_unpackhi_ps(rF[1], rF[5]); 
     rrF[4] = _mm256_unpacklo_ps(rF[2], rF[6]); 
     rrF[5] = _mm256_unpackhi_ps(rF[2], rF[6]); 
     rrF[6] = _mm256_unpacklo_ps(rF[3], rF[7]); 
     rrF[7] = _mm256_unpackhi_ps(rF[3], rF[7]); 

     rF = reinterpret_cast<__m256 *>(in);
     rF[0] = _mm256_unpacklo_ps(rrF[0], rrF[4]); 
     rF[1] = _mm256_unpackhi_ps(rrF[0], rrF[4]); 
     rF[2] = _mm256_unpacklo_ps(rrF[1], rrF[5]); 
     rF[3] = _mm256_unpackhi_ps(rrF[1], rrF[5]); 
     rF[4] = _mm256_unpacklo_ps(rrF[2], rrF[6]); 
     rF[5] = _mm256_unpackhi_ps(rrF[2], rrF[6]); 
     rF[6] = _mm256_unpacklo_ps(rrF[3], rrF[7]); 
     rF[7] = _mm256_unpackhi_ps(rrF[3], rrF[7]); 
  }

  inline void Transpose8x8Insert(unsigned long *in)
  {
     __m256i *inIa = reinterpret_cast<__m256i *>(in);
     __m256i *inIb = reinterpret_cast<__m256i *>(&reinterpret_cast<__m128i *>(in)[1]);
     __m128i *inI128 = reinterpret_cast<__m128i *>(in);
     __m256i rI[8];
     rI[0] = _mm256_insertf128_si256(inIa[0], inI128[8], 1); 
     rI[1] = _mm256_insertf128_si256(inIb[0], inI128[9], 1); 
     rI[2] = _mm256_insertf128_si256(inIa[1], inI128[10], 1); 
     rI[3] = _mm256_insertf128_si256(inIb[1], inI128[11], 1); 
     rI[4] = _mm256_insertf128_si256(inIa[2], inI128[12], 1); 
     rI[5] = _mm256_insertf128_si256(inIb[2], inI128[13], 1); 
     rI[6] = _mm256_insertf128_si256(inIa[3], inI128[14], 1); 
     rI[7] = _mm256_insertf128_si256(inIb[3], inI128[15], 1); 

     __m256 rrF[8];
     __m256 *rF = reinterpret_cast<__m256 *>(rI);
     rrF[0] = _mm256_unpacklo_ps(rF[0], rF[4]); 
     rrF[1] = _mm256_unpackhi_ps(rF[0], rF[4]); 
     rrF[2] = _mm256_unpacklo_ps(rF[1], rF[5]); 
     rrF[3] = _mm256_unpackhi_ps(rF[1], rF[5]); 
     rrF[4] = _mm256_unpacklo_ps(rF[2], rF[6]); 
     rrF[5] = _mm256_unpackhi_ps(rF[2], rF[6]); 
     rrF[6] = _mm256_unpacklo_ps(rF[3], rF[7]); 
     rrF[7] = _mm256_unpackhi_ps(rF[3], rF[7]); 

     rF = reinterpret_cast<__m256 *>(in);
     rF[0] = _mm256_unpacklo_ps(rrF[0], rrF[4]); 
     rF[1] = _mm256_unpackhi_ps(rrF[0], rrF[4]); 
     rF[2] = _mm256_unpacklo_ps(rrF[1], rrF[5]); 
     rF[3] = _mm256_unpackhi_ps(rrF[1], rrF[5]); 
     rF[4] = _mm256_unpacklo_ps(rrF[2], rrF[6]); 
     rF[5] = _mm256_unpackhi_ps(rrF[2], rrF[6]); 
     rF[6] = _mm256_unpacklo_ps(rrF[3], rrF[7]); 
     rF[7] = _mm256_unpackhi_ps(rrF[3], rrF[7]); 
  }

  int main()
  {
  #define dwordCount 64
     alignas(32) unsigned long mat[dwordCount];
     for (int i = 0; i < dwordCount; i++) {
        mat[i] = i;
     }
     clock_t t;
     printf ("Transpose8x8Shuff 100,000,000 times\n");
     Transpose8x8Shuff(mat);
     t = clock();
     int i = 100000000;
     do {
        Transpose8x8Shuff(mat);
     } while (--i >= 0);
     t = clock() - t;
     volatile int dummy = mat[2];
     printf ("(%f seconds).\n",((float)t)/CLOCKS_PER_SEC);
     printf ("Transpose8x8Permute 100,000,000 times\n");
     Transpose8x8Permute(mat);
     t = clock();
     i = 100000000;
     do {
        Transpose8x8Permute(mat);
     } while (--i >= 0);
     t = clock() - t;
     volatile int dummy = mat[2];
     printf ("(%f seconds).\n",((float)t)/CLOCKS_PER_SEC);
     printf ("Transpose8x8Insert 100,000,000 times\n");
     Transpose8x8Insert(mat);
     t = clock();
     i = 100000000;
     do {
        Transpose8x8Insert(mat);
     } while (--i >= 0);
     t = clock() - t;
     volatile int dummy = mat[2];
     printf ("(%f seconds).\n",((float)t)/CLOCKS_PER_SEC);
     char c = getchar(); 
     return 0;
  }

This is benchmarking latency, not throughput (because the output for one transpose is the input for the next), but it probably bottlenecks on shuffle throughput anyway.

Results on Skylake i7-6700k @ 3.9GHz for a modified version of the above code (see it on the Godbolt compiler explorer), fixing the following bugs:

printf outside the timed regions, before starting the clock()
volatile dummy = in[2] at the end so all the transposes don't optimize away (which gcc actually does otherwise).
portable C++11, doesn't require MSVC (alignas(32) instead of __declspec, and don't include stdafx.h.)
sleep removed so the CPU won't downclock to idle speed between tests.

I didn't fix the unnecessary mixing of __m256i* / __m256*, and I didn't check if that led to worse code-gen with gcc or clang. I also didn't use a std::chrono high-rez clock because clock() was accurate enough for this many repeats.

g++7.3 -O3 -march=native on Arch Linux: Z Boson's version is fastest

Transpose8x8Shuff 100,000,000 times
(0.637479 seconds).
Transpose8x8Permute 100,000,000 times
(0.792658 seconds).
Transpose8x8Insert 100,000,000 times
(0.846590 seconds).

clang++ 5.0.1 -O3 -march=native: 8x8Permute gets optimized to something even faster than anything gcc did, but 8x8Insert is pessimized horribly.

Transpose8x8Shuff 100,000,000 times
(0.642185 seconds).
Transpose8x8Permute 100,000,000 times
(0.622157 seconds).
Transpose8x8Insert 100,000,000 times
(2.149958 seconds).

The asm instructions generated from the source won't match the intrinsics exactly: especially clang has a shuffle optimizer that really compiles the shuffles the same way it optimizes scalar code like + on integers. Transpose8x8Insert should not be that much slower, so clang must have chosen poorly.

Tronna answered 17/8, 2018 at 1:12 Comment(10)

C intrinsics and variables aren't exactly instructions or registers. Loads and stores take extra instructions. (AVX mostly avoids extra register-copy instructions, though.) Why would you want to force the compiler to emit store instructions to update in[] when you could have used local variables? Same amount of store instructions, but when compiling for x86-64, they'll all fit in registers and the stores can be optimized away. (Also, _mm256_permute2f128_ps instead of si256 avoids casting the pointers to two different types.) – Vasty 17/8, 2018 at 2:27

Without __restrict, the compiler doesn't know if the input and output overlap, so it actually does all the stores, including to outI[]. Fixing that makes your version about the same instruction count as Z Boson's (adapted to read and write in/out array instead of modify references in-place) with gcc -m32 -O3 -march=haswell: godbolt.org/g/bEwS4r. gcc aligns the stack by 32 and spills temporaries to the stack in both versions. So your idea of using in[] to avoid extra spills didn't work, and just leads to more work done in the asm. – Vasty 17/8, 2018 at 3:1

gcc does optimize some of the _mm256_permute2f128_si256 into 128-bit load + vinserti128, like Z Boson did by hand in the 2nd version. (Yes, it used the integer version because you used si256 instead of ps). – Vasty 17/8, 2018 at 3:2

I fixed the issues you brought up. It now uses local scratch areas to prevent the overlap issue. Indeed when I ran it without the local scratch areas it came out around 0.9s. An advantage of doing the _mm256_permute2f128_si256 to pre lane the data is, this can be extended to work on bytes and words. So to transpose a byte matrix you could _mm256_unpacklo_epi8 4 times. – Tronna 17/8, 2018 at 18:22

Votes aren't permanent, and they reflect the quality of the answer at the time (just a code dump with some bogus reasoning that seemed to think avoiding _mm256_load_ps counted as saving asm instructions), not how much effort you put in. Once you improved your answer to make the code not suck, and included benchmarks (presumably from MSVC because you used non-portable __declspec instead of C++11 alignas(32)), I upvoted. You could improve this further by mentioning which compiler + options you used, and including some of your comments in the answer (e.g. byte matrix). – Vasty 18/8, 2018 at 2:7

I fixed your code to take printf out of the timed region, and not sleep so the CPU won't drop to idle speed and have to ramp up again before every test. I edited my timing results into your answer, along with a link to the full code on godbolt. – Vasty 18/8, 2018 at 2:28

I did some of the edits you hinted at. Wasn't sure where the volatile dummy = in[2] was exactly supposed to reside. It would be nice to clean up some of these comments and focus this into something useful. Maybe leaning towards pre-laning data, when to do it and what implications it has for certain compilers. I have more questions/advice but don't want to clog the comments. (side note: I have an old bulldozer I may dig out and build a linux box to see how it reacts to this code) – Tronna 19/8, 2018 at 20:17

I linked my entire source code on Godbolt in my edit, search there for where to put the volatile sink. It goes after the last loop, because there's a dependency chain running through all the transposes, so you only need it once at the very end to stop the compiler from throwing away any of the work (unless it can see through the transposes and just store a constant 2, but current compilers can't). Or just print out in[2] as one of the args to printf, so the compiler needs the result. And BTW, including <immintrin.h> makes all the other <xmmintrin.h> and so on redundant. – Vasty 19/8, 2018 at 20:21

Feel free to edit my code as you see fit. I would prefer to keep my variable as array declarations. Thanks. – Tronna 19/8, 2018 at 20:28

I think you should test this using implementation with gathers, in terms of instruction count it would be much smaller. – Upshaw 20/4, 2023 at 19:40

Further to previous answers, the usage of shuffleps is pretty much overkill in this scenario, since we can just unpacklo/unpackhi our way to the result. shuffle & unpack instructions have the same latency/throughput, however shuffle generates an additional byte in the machine code op (i.e. 5 bytes for shuffle, 4 for unpack).

At some point, we will require 8 permutes across lanes. This is a slower operation (at 3 cycles latency), so we want to kick off those ops earlier if possible. Assuming the transpose8f method gets inlined (it should do!), then any loads required for the a->h args should be fused into the unpack instructions.

The only minor issue you may face is that because you are using more than 8 registers here, you may spill into YMM9 and up. That can cause VEX2 ops to be generated as VEX3, which will add a byte per op.

As a result, with a bit of jiggling around, you'll end up with this:

typedef __m256 f256;
#define unpacklo8f _mm256_unpacklo_ps
#define unpackhi8f _mm256_unpackhi_ps

template<uint8_t X, uint8_t Y>
inline f256 permute128f(const f256 a, const f256 b)
{
  return _mm256_permute2f128_ps(a, b, X | (Y << 4)); 
}

inline void transpose8f(
  const f256 a, const f256 b, const f256 c, const f256 d, 
  const f256 e, const f256 f, const f256 g, const f256 h, 
  f256& s, f256& t, f256& u, f256& v,
  f256& x, f256& y, f256& z, f256& w)
{
  const f256 t00 = unpacklo8f(a, c);
  const f256 t02 = unpacklo8f(b, d);
  const f256 t20 = unpacklo8f(e, g);
  const f256 t22 = unpacklo8f(f, h);

  const f256 t10 = unpacklo8f(t00, t02);
  const f256 t30 = unpacklo8f(t20, t22);
  const f256 t11 = unpackhi8f(t00, t02);
  s = permute128f<0, 2>(t10, t30);
  const f256 t31 = unpackhi8f(t20, t22);
  x = permute128f<1, 3>(t10, t30);
  const f256 t01 = unpackhi8f(a, c);
  t = permute128f<0, 2>(t11, t31);
  const f256 t21 = unpackhi8f(e, g);
  y = permute128f<1, 3>(t11, t31);
  const f256 t03 = unpackhi8f(b, d);
  const f256 t23 = unpackhi8f(f, h);

  const f256 t12 = unpacklo8f(t01, t03);
  const f256 t13 = unpackhi8f(t01, t03);
  const f256 t32 = unpacklo8f(t21, t23);
  const f256 t33 = unpackhi8f(t21, t23);

  u = permute128f<0, 2>(t12, t32);
  z = permute128f<1, 3>(t12, t32);
  v = permute128f<0, 2>(t13, t33);
  w = permute128f<1, 3>(t13, t33);
}

You won't improve on this (You can do the 128bit permutes first, and unpacks second, but they'll end up being identical).

Goa answered 30/10, 2018 at 3:16 Comment(0)

This is my solution with less instructions and the performance is very good about 8 times faster. I've tested using ICC, GCC and Clang in Fedora.

#include <stdio.h>
#include <x86intrin.h>
#define MAX1    128
#define MAX2 MAX1

float __attribute__(( aligned(32))) a_tra[MAX2][MAX1], __attribute__(( aligned(32))) a[MAX1][MAX2] ;
int main()
{

    int i,j;//, ii=0,jj=0;
    // variables for vector section
    int vindexm [8]={0, MAX1, MAX1*2, MAX1*3, MAX1*4, MAX1*5, MAX1*6, MAX1*7 };
    __m256i vindex = _mm256_load_si256((__m256i *) &vindexm[0]);
    __m256 vec1, vec2, vec3, vec4, vec5, vec6, vec7, vec8;

        for(i=0; i<MAX1;  i+=8){            
            for(j=0; j<MAX2;  j+=8){
                //loading from columns
                vec1 = _mm256_i32gather_ps (&a[i][j+0],vindex,4);
                vec2 = _mm256_i32gather_ps (&a[i][j+1],vindex,4);
                vec3 = _mm256_i32gather_ps (&a[i][j+2],vindex,4);
                vec4 = _mm256_i32gather_ps (&a[i][j+3],vindex,4);
                vec5 = _mm256_i32gather_ps (&a[i][j+4],vindex,4);
                vec6 = _mm256_i32gather_ps (&a[i][j+5],vindex,4);
                vec7 = _mm256_i32gather_ps (&a[i][j+6],vindex,4);
                vec8 = _mm256_i32gather_ps (&a[i][j+7],vindex,4);

                //storing to the rows
                _mm256_store_ps(&a_tra[j+0][i], vec1);
                _mm256_store_ps(&a_tra[j+1][i], vec2);
                _mm256_store_ps(&a_tra[j+2][i], vec3);
                _mm256_store_ps(&a_tra[j+3][i], vec4);
                _mm256_store_ps(&a_tra[j+4][i], vec5);
                _mm256_store_ps(&a_tra[j+5][i], vec6);
                _mm256_store_ps(&a_tra[j+6][i], vec7);
                _mm256_store_ps(&a_tra[j+7][i], vec8);  
            }
        }
    return 0;
}

Unlearned answered 29/10, 2018 at 0:4 Comment(20)

On what CPU? I assume on Skylake, because earlier architectures have worse gather throughput. Are you sure you tested the same amount of work? 8x faster doesn't seem reasonable unless you did something wrong like compile without optimization. Also note that your code is unsafe: You forgot to use alignas(32) int vindexm[8], so your code avoids faulting on an aligned load only because your compiler notices that the array is being used for a 256-bit load and chooses to align it. Just use _mm256_setr_epi32(0, MAX1, MAX1*2, ...) like a normal person instead of a non-const array at all. – Vasty 29/10, 2018 at 0:30

@PeterCordes, your assumption is right I have tested this on Skylake. You've taught me optimization before and yes I have noticed that. The scalar and intrinsic versions are compiled with gcc/clang -march=native -O3 -fno-tree-vectorize -fno-slp-vectorize and icc -O3 -novec. Thank you for your suggestions I remember them in future researches. I haven't tested Haswell microarchitecture because I only have a Skylake CPU Core i7 6700HQ – Unlearned 29/10, 2018 at 10:19

Wait, you were only comparing this against scalar, not against the other answers here? – Vasty 29/10, 2018 at 10:21

@PeterCordes, even in Skylake if you mix gather instruction with other computations you'll lose performance. Gather instructions and store are almost pure memory operations and that's the reason that I gain performance. – Unlearned 29/10, 2018 at 10:21

@PeterCordes, no only scalar. the other answer's performance must be almost the same as this answer but this answer uses less instructions and is more readable – Unlearned 29/10, 2018 at 10:23

@PeterCordes, I've tested the answers before (a year ago) and In some cases other instructions gained more performance but for example 9x vs 8x or 8x vs 7.5x. – Unlearned 29/10, 2018 at 10:24

even in Skylake if you mix gather instruction with other computations you'll lose performance. citation needed. If you have useful work you can do in parallel with gathers, you'll keep more execution units busy. Compared to other answers, vpgatherdd is 4 uops for each instruction, vs. 1 uop per shuffle in the other answers. So uop-cache footprint is probably worse, even if L1i cache footprint is maybe better. – Vasty 29/10, 2018 at 10:26

This solution cause bottleneck in gather units (IACA showed that) – Unlearned 29/10, 2018 at 10:26

More importantly, this will perform very badly on Haswell, and even worse on AMD Ryzen, while the shuffles are just as fast. (Like worse than half the throughput on Haswell (12c throughput for vpgatherdd ymm) vs. on Skylake (5c). Ryzen has 20c throughput for vpgatherdd ymm. i.e. Ryzen it can only run one gather instruction per 20 core clock cycles, vs. per 5 clocks on Skylake. So you've written a cripple-AMD function. – Vasty 29/10, 2018 at 10:30

I have tested that and I don't have the citation. but here is the piece of my code for matrix multiplication.

for( i=0;i<MAX1;i++){ 	for(j=0;j<MAX3;j++){ 		sum0_i = _mm256_setzero_si256(); 		for(k=0;k<MAX2;k=k+8){ 			sum0_i = _mm256_add_epi32( 			sum0_i,_mm256_mullo_epi32( 			_mm256_load_si256((__m256i *)&a[i][k]) , 			_mm256_i32gather_epi32 ((int const *)&b[k][j],vindex,4))); 		} 		_mm256_store_si256((__m256i *)&temp8[0] , sum0_i); 		c_result[i][j]= temp8[0]+temp8[1]+temp8[2]+temp8[3]+temp8[4]+temp8[5]+temp8[6]+temp8[7]; 	} }

– Unlearned 29/10, 2018 at 10:30

My reaserches are bounded to my platform and I knew that gather instructions are bad in other platforms. I even noticed that in my research paper. hopefully the citation will be available after publishing my last paper. – Unlearned 29/10, 2018 at 10:32

"citation needed" is a phrase from wikipedia re: a suspicious-looking claim; I wasn't meaning to ask for actual published research. :P It's nearly unreadable in a comment, but I think you're doing a 32-bit integer matmul with on-the-fly transpose, and a store/reload horizontal sum which looks like a bad idea for an inner loop. IDK why that would be slower than transposing separately, but it doesn't look good. mullo_epi32 has high-ish latency (2 dependent uops / 10 cycles), so maybe out-of-order execution has trouble hiding it and overlapping enough work? – Vasty 29/10, 2018 at 10:35

Yeah, I have noticed that if we use gather instruction in this way we will lose performance. and its speedup (actually slowdown) is 0.5x When we transpose it before multiplications we separate the computation part and reorganize the matrix in such a way that doesn't need gather or other shuffling instruction inside the inner loop – Unlearned 29/10, 2018 at 10:39

Oh, were you transposing O(n^3) times inside the inner-most loop instead of saving / reusing the transpose results from one column vector for all row vectors? Yeah of course that's slower, but not because of mixing gather + compute, just because you're doing n times as many transposes! And you have worse spatial locality for repeated looping over the matrix, if your size is larger than 8x8. This is like cache-blocking, except instead of saving a transpose result to memory, you're saving it to a vector reg. – Vasty 29/10, 2018 at 10:41

No, I don't transpose inside the inner loop. That is another version that gains performance. AB is one algorithm and ABT is another one – Unlearned 29/10, 2018 at 10:44

Then why is _mm256_i32gather_epi32 inside the triple-nested loop in your comment? If it's not doing a transpose, then what is it doing? – Vasty 29/10, 2018 at 10:45

gather reads eight elements from matrix B. Another eight element are read by load and they are multiplied. – Unlearned 29/10, 2018 at 10:47

Right, so you're redoing the work of transposing on the fly into a register inside the inner loop. It's still a transpose whether you save the result to memory anywhere. Of if you don't like that terminology, you're redoing the work of non-sequential loading from matrix B inside the inner loop N times for each column vector in an NxN matrix. Of course that's slow. – Vasty 29/10, 2018 at 10:49

Yes, but because I wanted to vectorize this algorithm: for(i=0; i<N; i++){ for(j=0; j<P; j++){ temp=0; for(k=0; k<M; k++){ temp += A[i][k] * B[k][j];} C[i][j]= temp;}} and I this was easier to implement it with gather instructions. instead o separate them. the other version is

for(i=0; i<N; i++){ for(j=0; j<M; j++){ B_T[j][i] = B[i][j];}} for(i=0; i<N; i++){ for(j=0; j<P; j++){ temp=0; for(k=0; k<M; k++){ temp += A[i][k] * B_T[j][k];} C[i][j] = temp;}}

that use transposed version of the second matrix – Unlearned 29/10, 2018 at 10:51

Because I live in Iran. godbolt.org banned my IP and my VPN is finished I can't copy them there. So please accept my apologies to copy and paste codes in comments. – Unlearned 29/10, 2018 at 10:55

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags