Converting from Source-based Indices to Destination-based Indices
Asked Answered
P

3

0

I'm using AVX2 instructions in some C code.

The VPERMD instruction takes two 8-integer vectors a and idx and generates a third one, dst, by permuting a based on idx. This seems equivalent to dst[i] = a[idx[i]] for i in 0..7. I'm calling this source based, because the move is indexed based on the source.

However, I have my calculated indices in destination based form. This is natural for setting an array, and is equivalent to dst[idx[i]] = a[i] for i in 0..7.

How can I convert from source-based form to destination-based form? An example test case is:

{2 1 0 5 3 4 6 7}    source-based form. 
{2 1 0 4 5 3 6 7}    destination-based equivalent

For this conversion, I'm staying in ymm registers, so that means that destination-based solutions don't work. Even if I were to insert each separately, since it only operates on constant indexes, you can't just set them.

Pandiculation answered 31/8, 2016 at 23:35 Comment(4)
That's just the classic "permutation inversion", dst[src[i]] = iSleekit
Right. But your code requires the ability to set in a destination-based manner. Since I'm operating in AVX2 registers. I can't do that. I have working C code that does it almost exactly as you say, but I need to be able to transform the indices without being able to do a destination-based permutation, like you suggest.Pandiculation
Your a[i] = a[idx[i]] for i in 0..7 didn't correctly describe VPERMD's operation, because it implied that changes to a would feed back into the a[idx[i]] for later elements. e.g. The original a[0] would always be destroyed right away, unless idx[0] = 0. I think your example is still sane after my edit to correct that bug (or was assuming that behaviour the whole time).Gadoid
Thanks for the edit. I did understand the behavior, but I didn't describe it correctly.Pandiculation
G
2

I guess you're implicitly saying that you can't modify your code to calculate source-based indices in the first place? I can't think of anything you can do with x86 SIMD, other than AVX512 scatter instructions that take dst-based indices. (But those are not very fast on current CPUs, even compared to gather loads. https://uops.info/)

Storing to memory, inverting, and reloading a vector might actually be best. (Or transferring to integer registers directly, not through memory, maybe after a vextracti128 / packusdw so you only need two 64-bit transfers from vector to integer regs: movq and pextrq).

But anyway, then use them as indices to store a counter into an array in memory, and reload that as a vector. This is still slow and ugly, and includes a store-forwarding failure delay. So it's probably worth your while to change your index-generating code to generate source-based shuffle vectors if at all possible.

Gadoid answered 1/9, 2016 at 2:49 Comment(1)
Thanks for the idea with the scatter operation. I added an example as answer.Id
I
1

To benchmark the solution, I modified the code from my other answer to compare the performance of the scatter instruction (USE_SCATTER defined) with a store and sequential permute (USE_SCATTER undefined). I had to copy the result back to the permutation pattern perm in order to prevent the compiler from optimizing the loop body away:

#ifdef TEST_SCATTER
  #define REPEATS 1000000001
  #define USE_SCATTER
  
  __m512i ident = _mm512_set_epi32(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0);
  __m512i perm  = _mm512_set_epi32(7,9,3,0,5,8,13,11,4,2,15,1,12,6,10,14);  
  uint32_t outA[16] __attribute__ ((aligned(64)));
  uint32_t id[16], in[16];
  _mm512_storeu_si512(id, ident);
  for (int i = 0; i < 16; i++) printf("%2d ", id[i]); puts("");
  _mm512_storeu_si512(in, perm);
  for (int i = 0; i < 16; i++) printf("%2d ", in[i]); puts("");
#ifdef USE_SCATTER
  puts("scatter");
  for (long t = 0; t < REPEATS; t++) {
    _mm512_i32scatter_epi32(outA, perm, ident, 4);
    perm = _mm512_load_si512(outA);
  }
#else
  puts("store & permute");
  uint32_t permA[16] __attribute__ ((aligned(64)));
  for (long t = 0; t < REPEATS; t++) {
    _mm512_store_si512(permA, perm);
    for (int i = 0; i < 16; i++) outA[permA[i]] = i;
    perm = _mm512_load_si512(outA);    
  }
#endif
  for (int i = 0; i < 16; i++) printf("%2d ", outA[i]); puts("");

#endif

Here's the output for the two cases (using the builtin time command of tcsh, the u output is user-space time in seconds):

 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 
14 10  6 12  1 15  2  4 11 13  8  5  0  3  9  7 
store & permute
12  4  6 13  7 11  2 15 10 14  1  8  3  9  0  5 
10.765u 0.001s 0:11.22 95.9%    0+0k 0+0io 0pf+0w

 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 
14 10  6 12  1 15  2  4 11 13  8  5  0  3  9  7 
scatter
12  4  6 13  7 11  2 15 10 14  1  8  3  9  0  5 
10.740u 0.000s 0:11.19 95.9%    0+0k 40+0io 0pf+0w

The runtime is about the same (Intel(R) Xeon(R) W-2125 CPU @ 4.00GHz, clang++-6.0, -O3 -funroll-loops -march=native). I checked the assembly code generated. With USE_SCATTER defined, the compiler generates vpscatterdd instructions, without it generates complex code using vpextrd, vpextrq, and vpextracti32x4.


Edit: I was worried that the compiler may have found a specific solution for the fixed permutation pattern I used. So I replaced it with a randomly generated pattern from std::random_shuffe(), but the time measurements are about the same.


Edit: Following the comment by Peter Cordes, I wrote a modified benchmark that hopefully measures something like throughput:

  #define REPEATS 1000000
  #define ARRAYSIZE 1000
  #define USE_SCATTER
  
  std::srand(unsigned(std::time(0)));
  // build array with random permutations
  uint32_t permA[ARRAYSIZE][16] __attribute__ ((aligned(64)));
  for (int i = 0; i < ARRAYSIZE; i++)
    _mm512_store_si512(permA[i], randPermZMM());
  // vector register
  __m512i perm;
#ifdef USE_SCATTER
  puts("scatter");
  __m512i ident = _mm512_set_epi32(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0);
  for (long t = 0; t < REPEATS; t++)
    for (long i = 0; i < ARRAYSIZE; i++) {
      perm = _mm512_load_si512(permA[i]);
      _mm512_i32scatter_epi32(permA[i], perm, ident, 4);
    }
#else
  uint32_t permAsingle[16] __attribute__ ((aligned(64)));
  puts("store & permute");
  for (long t = 0; t < REPEATS; t++)
    for (long i = 0; i < ARRAYSIZE; i++) {
      perm = _mm512_load_si512(permA[i]);
      _mm512_store_si512(permAsingle, perm);
      uint32_t *permAVec = permA[i];
      for (int e = 0; e < 16; e++)
    permAVec[permAsingle[e]] = e;
    }
#endif
  FILE *f = fopen("testperm.dat", "w");
  fwrite(permA, ARRAYSIZE, 64, f);
  fclose(f);

I use an array of permutation patterns which are modified sequentially without dependencies.

These are the results:

scatter
4.241u 0.002s 0:04.26 99.5% 0+0k 80+128io 0pf+0w

store & permute
5.956u 0.002s 0:05.97 99.6% 0+0k 80+128io 0pf+0w

So throughput is better when using the scatter command.

Id answered 17/12, 2020 at 8:50 Comment(6)
With the reload of output as the permutation for the next iteration, you're measuring latency, not throughput. Including the store-forwarding stall from doing the reload right away. That might reflect some use-cases, but measuring throughput might be interesting, too.Gadoid
That's 10.765u = user-space seconds, right? Not 10.765u = microseconds. 10 sec is plenty long to hide any warm-up effects and other overhead so that's fine.Gadoid
@PeterCordes: What would be a code modification to measure throughput? Apply this to an array of (randomly generated) permutation patterns? / The time measurements are from the tcsh builtin time command and the u time is user space time in seconds.Id
Easy in asm: run the instructions without feeding the result back as a dependency for the next iteration. So out-of-order exec can work its magic and overlap multiple iterations. Getting a C compiler to emit asm like that may require some careful use of volatile vars to force a load, or asm("" : "+x"(vector)) to forget anything it knows about the value of a vector variable. e.g. like Benchmark::DoNotOptimize or various other tricks. Then you check the generated asm to make sure it's looks realistic for what you want to measure.Gadoid
@PeterCordes: I included a modified benchmark in my answer. Do you think that is related to throughput?Id
Yeah, independent work on independent array elements is throughput-dependent, letting OoO exec hide latency. Your array is small enough that it will fit in L2 cache, but 1000 * 64 bytes is larger than 32kiB L1d. (Or even the 48kiB L1d on Ice Lake). But it's small enough that page faults won't be a complicating factor. You may want that to simulate the case where things aren't always hot in L1d, and updating pointers in a loop, so this is probably good.Gadoid
I
0

I had the same problem, but in the opposite direction: destination indices were easy to compute, but source indices were required for the application of SIMD permute instructions. Here's a solution for AVX-512 using a scatter instruction as suggested by Peter Cordes; it should also apply to the opposite direction:

__m512i ident = _mm512_set_epi32(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0);
__m512i perm  = _mm512_set_epi32(7,9,3,0,5,8,13,11,4,2,15,1,12,6,10,14);  
uint32_t id[16], in[16], out[16];
_mm512_storeu_si512(id, ident);
for (int i = 0; i < 16; i++) printf("%2d ", id[i]); puts("");
_mm512_storeu_si512(in, perm);
for (int i = 0; i < 16; i++) printf("%2d ", in[i]); puts("");
_mm512_i32scatter_epi32(out, perm, ident, 4);
for (int i = 0; i < 16; i++) printf("%2d ", out[i]); puts("");

An identity mapping ident is distributed to the out array according to the index pattern perm. The idea is basically the same as the one described for inverting a permutation. Here's the output:

 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 
14 10  6 12  1 15  2  4 11 13  8  5  0  3  9  7 
12  4  6 13  7 11  2 15 10 14  1  8  3  9  0  5 

Note that I have permutations in the mathematical sense (no duplicates). With duplicates, the out store needs to be initialized since some elements could remain unwritten.

I also see no easy way to accomplish this within registers. I thought about cycling through the given permutation by repeatedly applying a permute instruction. As soon as the identity pattern is reached, the one before is the inverse permutation (this goes back to the idea by EOF on unzip operations). However, the cycles can be long. The maximum number of cycles that may be required is given by Landau's function which for 16 elements is 140, see this table. I could show that it possible to shorten this to a maximum of 16 if the individual permutation subcycles are frozen as soon as they coincide with the identity elements. The shortens the average from 28 to 9 permute instructions for a test on random permutation patterns. However, it is still not an efficient solution (much slower than the scatter instruction in the throughput benchmark described in my other answer).

Id answered 17/12, 2020 at 6:53 Comment(2)
Have you benchmarked this against anything? Scatter instructions are not very fast; IDK if that was a good suggestion I made 4 years ago. But they're not terrible, and this might be better than any other option. Certainly it's much worse than a shuffle if you did have the right kind of indices.Gadoid
@PeterCordes: I added another answer with a benchmark. Apparently there's no performance advantage of using the scatter.Id

© 2022 - 2024 — McMap. All rights reserved.