I have an image processing algorithm to calculate a*b+c*d
with AVX. The pseudo code is as follows:
float *a=new float[N];
float *b=new float[N];
float *c=new float[N];
float *d=new float[N];
//assign values to a, b, c and d
__m256 sum;
double start=cv::getTickCount();
for (int i = 0; i < n; i += 8) // assume that n is a multiple of 8
{
__m256 am=_mm256_loadu_ps(a+i);
__m256 bm=_mm256_loadu_ps(b+i);
__m256 cm=_mm256_loadu_ps(c+i);
__m256 dm=_mm256_loadu_ps(d+i);
__m256 abm=_mm256_mul_ps(am, bm);
__m256 cdm=_mm256_mul_ps(cm, dm);
__m256 abcdm=_mm256_add_ps(abm, cdm);
sum=_mm256_add_ps(sum, abcdm);
}
double time1=(cv::getTickCount()-start)/cv::getTickFrequency();
I change _mm256_mul_ps and _mm256_add_ps on the above to _mm256_fmadd_ps as follows:
float *a=new float[N];
float *b=new float[N];
float *c=new float[N];
float *d=new float[N];
//assign values to a, b, c and d
__m256 sum;
double start=cv::getTickCount();
for (int i = 0; i < n; i += 8) // assume that n is a multiple of 8
{
__m256 am=_mm256_loadu_ps(a+i);
__m256 bm=_mm256_loadu_ps(b+i);
__m256 cm=_mm256_loadu_ps(c+i);
__m256 dm=_mm256_loadu_ps(d+i);
sum=_mm256_fmadd_ps(am, bm, sum);
sum=_mm256_fmadd_ps(cm, dm, sum);
}
double time2=(cv::getTickCount()-start)/cv::getTickFrequency();
But the code below is slower than the above! The above code execution time1 is 50ms, the below code execution time2 is 90ms. _mm256_fmadd_ps is slower than _mm256_mul_ps + _mm256_add_ps ???
I use Ubuntu 16.04, GCC 7.5.0 ,compiler flags: -fopenmp -march=native -O3
new
in your timing, are you? – And_mm256_add_ps
which depends on the result from the previous loop iteration, in the second you have two_mm256_fmadd_ps
which depend on each other and the previous loop iteration. (There is probably a duplicate for this ...) – Bevsum
). I decided it might be worth posting a summary answer instead of just the link. – Charlatan