AVX512 log2 or pow instructions
Asked Answered
L

0

8

I need a AVX512 double pow(double, int n) function (I need it for a binomial distribution calculation which needs to be exact). In particular I would like this for Knights Landing which has AVX512ER. One way to get this is

x^n = exp2(log2(x)*n)

Knights Corner has the vlog2ps instruction (_mm512_log2_ps intrinsic) and the vexp223ps instruction (_mm512_exp223_ps intrinsic) so at least I could do float pow(float, float) with those two instructions.

However, with Knights Landing I don't find a log2 instruction. I do find a vexp2pd instruction (_mm512_exp2a23_pd intrinsic) in AVX512ER. I find it strange that Knights Corner has a log2 instruction but Knights Landing which is newer and better does not.

For now I have implemented pow(double, n) using repeated squaring but I think it would be more efficient if I had a log2 instruction.

//AVX2 but easy to convert to AVX512 with mask registers
static __m256d pown_AVX2(__m256d base, __m256i exp) {
  __m256d result = _mm256_set1_pd(1.0);
  int mask = _mm256_testz_si256(exp, exp);
  __m256i onei = _mm256_set1_epi64x(1);
  __m256d onef = _mm256_set1_pd(1.0);
  while(!mask) {
    __m256i t1 = _mm256_and_si256(exp, onei);
    __m256i t2 = _mm256_cmpeq_epi64(t1, _mm256_setzero_si256());
    __m256d t3 = _mm256_blendv_pd(base, onef, _mm256_castsi256_pd(t2));
    result = _mm256_mul_pd(result, t3);
    exp = _mm256_srli_epi64(exp, 1);
    base = _mm256_mul_pd(base,base);
    mask = _mm256_testz_si256(exp, exp);
  }
  return result;
}

Is there a more efficient algorithm to get double pow(double, int n) with AVX512 and AVX512ER than repeated squaring? Is there an easy method (e.g. with a few instructions) to get log2?


Here is the AVX512F version using repeated squaring

static  __m512d pown_AVX512(__m512d base, __m512i pexp) {
  __m512d result = _mm512_set1_pd(1.0);
  __m512i onei = _mm512_set1_epi32(1);
  __mmask8 mask;
  do {
    __m512i t1 = _mm512_and_epi32(pexp, onei);
    __mmask8 mask2 = _mm512_cmp_epi32_mask(onei, t1, 0);
    result = _mm512_mask_mul_pd(result, mask2, result, base);
    pexp = _mm512_srli_epi32(pexp, 1);
    base = _mm512_mul_pd(base,base);
    mask = _mm512_test_epi32_mask(pexp, pexp);
  } while(mask);
  return result;
}

The exponents are int32 not int64. Ideally I would use __m256i for the eight integers. However, this requires AVX512VL which extends the 512b operations to 256b and 128b but KNL does not have AVX512VL. Instead I use the 512b operations on 32-bit integers and I cast the 16b mask to 8b.

Lactoprotein answered 7/2, 2017 at 9:36 Comment(15)
x^n = log2(x)*2^n is not an identity: 4^3 = 64 != log2(4)*2^3 = 16Zobkiw
mmm I found the VGETEXPPD that computes, according to Intel, floor(log2(|x|)). It should be in the AVX2 Foundation class. Can that be useful?Zobkiw
@MargaretBloom, good fine. I actually saw those yesterday and forgot to mention them. I'm not sure if they are sufficient. I need exp2(log2(x)*n) not exp2(floor(log2(x)*n)) but maybe another step would fix it.Lactoprotein
If you pre-compute the bit-length of the largest n, you can use bsr to loop over an integer which eliminates the _mm256_testz_si256(exp, exp).Melpomene
If you reverse the direction of the iteration and pre-shift exp so that you're testing the sign bit, you can eliminate both the _mm256_and_si256(exp, onei) and the _mm256_cmpeq_epi64(t1, _mm256_setzero_si256()). But that puts the squaring on the critical path.Melpomene
@Mysticial, I understand your first comment about bsr I think. But how does that help? Is vtestpd sets the flag register. Even if I decremented an int to zero I would still need to test for zero each iteration. Maybe I'm not thinking straight right now but it seems to me I would be replacing vtestpd for dec and adding a max and bsr instruction before the loop. I am still thinking about your second comment. I will probably have to ask more tomorrow.Lactoprotein
If you call bsr on the largest exponent, you can get the exact number of iterations you need to run. That way your loop condition changes to an integer check instead of the more expensive _mm256_testz_si256(). Not sure if that's actually worth the savings though.Melpomene
If x is fixed, you can precompute the powers of x and eliminate all the squares. If the exponent n is also constant among all the SIMD lanes, you can then skip entire iterations that correspond to zero-bits in n. And it's a perfect place to use bsr or bsf which also lets you branch out immediately when it reaches zero.Melpomene
@Mysticial, neither x nor n are fixed. The range of n is limited to n<=100 though. We might get more benefit from sorting n first. I will try your bsr suggestion.Lactoprotein
@Melpomene how do I get the maximum n for bsr efficiently? Horizontal max is hideous.Lactoprotein
@Mysticial, I updated my answer with a AVX512 version which I have tested. If I could get maximum bit I could unroll the loop into a switch statement with 7 cases since n<=100 the maximum bit is 7.Lactoprotein
I wonder what the down vote was for. I'm probably doing something silly with AVX512 which could be done better.Lactoprotein
It will depend on whether you have enough iterations to offset the cost of the horizontal max. The AVX512 version doesn't lend itself to abusing the sign-bit for the mask. If you have a lot of iterations (very large exponent), I wonder if the windowing algorithm is suitable using a permute-based lookup table.Melpomene
@Zboson "Is there a more efficient algorithm to get double pow(double, int n) with AVX512 and AVX512ER than repeated squaring?" For small exponents like in your case (n<=100), you could probably use addition-chain exponentiation which will be slightly faster, but requires more memory. However, not sure if it's worth it.Everara
An easy solution is to use the Vector Class Library. It has optimized power functions with integer and floating point powers. github.com/vectorclass/version2Accouter

© 2022 - 2024 — McMap. All rights reserved.