I need a AVX512 double pow(double, int n)
function (I need it for a binomial distribution calculation which needs to be exact). In particular I would like this for Knights Landing which has AVX512ER. One way to get this is
x^n = exp2(log2(x)*n)
Knights Corner has the vlog2ps
instruction (_mm512_log2_ps
intrinsic) and the vexp223ps
instruction (_mm512_exp223_ps intrinsic
) so at least I could do float pow(float, float)
with those two instructions.
However, with Knights Landing I don't find a log2
instruction. I do find a vexp2pd
instruction (_mm512_exp2a23_pd
intrinsic) in AVX512ER. I find it strange that Knights Corner has a log2
instruction but Knights Landing which is newer and better does not.
For now I have implemented pow(double, n)
using repeated squaring but I think it would be more efficient if I had a log2
instruction.
//AVX2 but easy to convert to AVX512 with mask registers
static __m256d pown_AVX2(__m256d base, __m256i exp) {
__m256d result = _mm256_set1_pd(1.0);
int mask = _mm256_testz_si256(exp, exp);
__m256i onei = _mm256_set1_epi64x(1);
__m256d onef = _mm256_set1_pd(1.0);
while(!mask) {
__m256i t1 = _mm256_and_si256(exp, onei);
__m256i t2 = _mm256_cmpeq_epi64(t1, _mm256_setzero_si256());
__m256d t3 = _mm256_blendv_pd(base, onef, _mm256_castsi256_pd(t2));
result = _mm256_mul_pd(result, t3);
exp = _mm256_srli_epi64(exp, 1);
base = _mm256_mul_pd(base,base);
mask = _mm256_testz_si256(exp, exp);
}
return result;
}
Is there a more efficient algorithm to get double pow(double, int n)
with AVX512 and AVX512ER than repeated squaring? Is there an easy method (e.g. with a few instructions) to get log2
?
Here is the AVX512F version using repeated squaring
static __m512d pown_AVX512(__m512d base, __m512i pexp) {
__m512d result = _mm512_set1_pd(1.0);
__m512i onei = _mm512_set1_epi32(1);
__mmask8 mask;
do {
__m512i t1 = _mm512_and_epi32(pexp, onei);
__mmask8 mask2 = _mm512_cmp_epi32_mask(onei, t1, 0);
result = _mm512_mask_mul_pd(result, mask2, result, base);
pexp = _mm512_srli_epi32(pexp, 1);
base = _mm512_mul_pd(base,base);
mask = _mm512_test_epi32_mask(pexp, pexp);
} while(mask);
return result;
}
The exponents are int32 not int64. Ideally I would use __m256i
for the eight integers. However, this requires AVX512VL which extends the 512b operations to 256b and 128b but KNL does not have AVX512VL. Instead I use the 512b operations on 32-bit integers and I cast the 16b mask to 8b.
x^n = log2(x)*2^n
is not an identity: 4^3 = 64 != log2(4)*2^3 = 16 – ZobkiwVGETEXPPD
that computes, according to Intel, floor(log2(|x|)). It should be in the AVX2 Foundation class. Can that be useful? – Zobkiwexp2(log2(x)*n)
notexp2(floor(log2(x)*n))
but maybe another step would fix it. – Lactoproteinn
, you can usebsr
to loop over an integer which eliminates the_mm256_testz_si256(exp, exp)
. – Melpomeneexp
so that you're testing the sign bit, you can eliminate both the_mm256_and_si256(exp, onei)
and the_mm256_cmpeq_epi64(t1, _mm256_setzero_si256())
. But that puts the squaring on the critical path. – Melpomenebsr
I think. But how does that help? Isvtestpd
sets the flag register. Even if I decremented an int to zero I would still need to test for zero each iteration. Maybe I'm not thinking straight right now but it seems to me I would be replacingvtestpd
fordec
and adding amax
andbsr
instruction before the loop. I am still thinking about your second comment. I will probably have to ask more tomorrow. – Lactoproteinbsr
on the largest exponent, you can get the exact number of iterations you need to run. That way your loop condition changes to an integer check instead of the more expensive_mm256_testz_si256()
. Not sure if that's actually worth the savings though. – Melpomenex
is fixed, you can precompute the powers ofx
and eliminate all the squares. If the exponentn
is also constant among all the SIMD lanes, you can then skip entire iterations that correspond to zero-bits inn
. And it's a perfect place to usebsr
orbsf
which also lets you branch out immediately when it reaches zero. – Melpomenex
norn
are fixed. The range ofn
is limited ton<=100
though. We might get more benefit from sortingn
first. I will try yourbsr
suggestion. – Lactoproteinn
forbsr
efficiently? Horizontal max is hideous. – Lactoproteinn<=100
the maximum bit is 7. – Lactoprotein