How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU?
Asked Answered
P

1

6

How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU?

I am about: pow(x, y) = exp(y*log(x))

I.e. do both exp() and log() AVX x86_64 instructions require certain known number of cycles?

Or the number of cycles may vary depending on the exponential degree, is there the maximum number of cycles can cost exponentiation?

Poet answered 19/7, 2015 at 14:10 Comment(7)
There's a maximum on any particular chip. There's no maximum enforced across all architectures that fall into the x86_64 category.Jumpy
There are no exp and log instructions, these are SVML functions.Hypochondriasis
There's an open source project which implements sin,cos,exp,and log functions using AVX. From there you could break those down into native instructionsGibun
Do you know that that the data is in CPU cache? Otherwise all times are dominated by memory access.Molasses
@stark: It doesn't take much bandwidth to feed something as slow as exp + log. Doing that on an un-cached buffer will probably still be bottlenecked on the CPU, not on memory bandwidth. (HW prefetching means that actual load insns will find their data in L1 already.)Mirella
Throughput and latency aren't the same thing, with pipelined execution units. e.g. Haswell can issue 2 vector multiplies per clock, but the result of each one is only ready 5 cycles after it issues. (Pretty much everything but FP divide / sqrt is fully pipelined.) Anyway, the point being, vector exp and log functions might be short enough to fit in the ROB (re-order buffer), but have long enough dep chains that there's scope for more stuff to execute while the CPU works its way through the dependent chain of vector insns.Mirella
Thanks to register renaming, even another call to exp or log could have its execution happening in parallel with the first one, if the input data was ready. If not, eventually the ROB will fill, and new instructions can only issue at the rate the execution units can dispatch the uops in it. (So there's a limit how far ahead the CPU can look to find independent work to do, of ~100 instructions.)Mirella
A
9

The x86 SIMD instruction set (i.e. not x87), at least up to AVX2, does not include SIMD exp, log, or pow with the exception of pow(x,0.5) which is the square root.

There are SIMD math libraries however which are built from SIMD instructions which have these functions (among others). Intel's SVML includes:

__m256 _mm256_exp_ps(__m256)
__m256 _mm256_log_ps(__m256)
__m256 _mm256_pow_ps(__m256, __m256)

which Intel disingenuously calls intrinsics when they are in fact functions with several instructions. SVML is closed source and expensive. However, by searching for svml after installing the Intel OpenCL runtime I found some svml files in the OpenCL directories so I think you can get SVML indirectly through Intel's OpenCL runtime.

AMD also provides a SIMD math library called LibM, which is closed source but free, which also has its own SIMD math functions:

__m128 amd_vrs4_expf(__m128)
__m128 amd_vrs4_logf(__m128)
__m128 amd_vrs4_powf(__m128, __m128)

Agner Fog's Vector Class Library provides an interface to SVML and LibM. See the file vectormath_lib.h. From this you can figure out the corresponding functions from SVML and LibM.

Agner also provides his own code for these functions which he claims is competitive with the proprietary Intel and AMD version. For Agner's version of the functions look in vectormath_exp.h e.g. look at exp_f, log_f, and pow_template_f and then look at the generated assembly.

You can use SVML, LibM, and Agner's own functions to time the exp and log functions. However, you should know that SVML and LibM don't play well on the others hardware. AMD for example is optimized for FMA4 which Intel does not have (but Intel original planned to have FMA4 and then changed to FMA3 suddenly after AMD had already planned for FMA4). Intel appears to do something ummm...well I suggest you read about it.

So if you time SVML or LibM on AMD or Intel processors respectively you will likely get very different results in performance (unless you manage to replace Intel's CPU dispatch function). Unlike GPUs the x86 instructions set is publicly available so you can build your own exp and log functions and that is what Agner has done.


Update

Glibc 2.22 (which should come out soon) has a vector math library called libmvec. Apparently it's enabled starting at -O1 along with -ffast-math and -fopenmp. I'm not sure why fast-math and OpenMP are necessary (particularly in the example below as associative math is not necessary) but it's great to finally have a SIMD math library in the GNU C standard library.

//gcc ./cos.c -O1 -fopenmp -ffast-math -lm -mavx2 
#include <math.h>

int N = 3200;
double b[3200];
double a[3200];

int main (void)
{
  int i;

  #pragma omp simd
  for (i = 0; i < N; i += 1)
  {
    b[i] = cos (a[i]);
  }

  return (0);
}
Auberta answered 20/7, 2015 at 11:39 Comment(1)
OpenMP is needed because libmvec supports the "SIMD constructs of OpenMP 4.0".Grimes

© 2022 - 2024 — McMap. All rights reserved.