FFTW plan creation using OpenMP

int m; // assume: // int numberOfColumns = 100; // int numberOfRows = 100; #pragma omp parallel for default(none) private(m) shared(numberOfColumns, numberOfRows)// num_threads(4) for(m = 0; m < 36; m++){ // create pointers double *inputTest; fftw_complex *outputTest; fftw_plan testPlan; // preallocate vectors for FFTW outputTest = (fftw_complex*)fftw_malloc(sizeof(fftw_complex)*numberOfRows*numberOfColumns); inputTest = (double *)fftw_malloc(sizeof(double)*numberOfRows*numberOfColumns); // confirm that preallocation worked if (inputTest == NULL || outputTest == NULL){ logger_.log_error("\t\t FFTW memory not allocated on m = %i", m); } // EDIT: insert data into inputTest inputTest = someDataSpecificToThisIteration(m); // same size for all m // create FFTW plan #pragma omp critical (make_plan) { testPlan = fftw_plan_dft_r2c_2d(numberOfRows, numberOfColumns, inputTest, outputTest, FFTW_ESTIMATE); } // confirm that plan was created correctly if (testPlan == NULL){ logger_.log_error("\t\t failed to create plan on m = %i", m); } // execute plan fftw_execute(testPlan); // clean up fftw_free(inputTest); fftw_free(outputTest); fftw_destroy_plan(testPlan); }// end parallelized for loop

It's pretty much all written in the FFTW documentation about thread safety:

... but some care must be taken because the planner routines share data (e.g. wisdom and trigonometric tables) between calls and plans.

The upshot is that the only thread-safe (re-entrant) routine in FFTW is fftw_execute (and the new-array variants thereof). All other routines (e.g. the planner) should only be called from one thread at a time. So, for example, you can wrap a semaphore lock around any calls to the planner; even more simply, you can just create all of your plans from one thread. We do not think this should be an important restriction (FFTW is designed for the situation where the only performance-sensitive code is the actual execution of the transform), and the benefits of shared data between plans are great.

In a typical application of FFT plans are constructed seldom, so it doesn't really matter if you have to synchronise their creation. In your case you don't need to create a new plan at each iteration, unless the dimension of the data changes. You would rather do the following:

#pragma omp parallel default(none) private(m) shared(numberOfColumns, numberOfRows)
{
   // create pointers
   double          *inputTest;
   fftw_complex    *outputTest;
   fftw_plan       testPlan;

   // preallocate vectors for FFTW
   outputTest = (fftw_complex*)fftw_malloc(sizeof(fftw_complex)*numberOfRows*numberOfColumns);
   inputTest  = (double *)fftw_malloc(sizeof(double)*numberOfRows*numberOfColumns);

   // confirm that preallocation worked
   if (inputTest == NULL || outputTest == NULL){
      logger_.log_error("\t\t FFTW memory not allocated on m = %i", m);
   }

   // create FFTW plan
   #pragma omp critical (make_plan)
   testPlan = fftw_plan_dft_r2c_2d(numberOfRows, numberOfColumns, inputTest, outputTest, FFTW_ESTIMATE);

   #pragma omp for
   for (m = 0; m < 36; m++) {
      // execute plan
      fftw_execute(testPlan);
   }

   // clean up
   fftw_free(inputTest);
   fftw_free(outputTest);
   fftw_destroy_plan(testPlan);
}

Now the plans are created only once in each thread and the serialisation overhead would diminish with each execution of fftw_execute(). If running on a NUMA system (e.g. a multi-socket AMD64 or Intel (post-)Nehalem system), then you should enable thread binding in order to achieve maximum performance.

Recommended topics

Hot tags