FFTW plan creation using OpenMP
Asked Answered
P

1

7

I am trying to perform several FFT's in parallel. I am using FFTW and OpenMP. Each FFT is different, so I'm not relying on FFTW's build-in multithreading (which I know uses OpenMP).

int m;

// assume:
// int numberOfColumns = 100;
// int numberOfRows = 100;

#pragma omp parallel for default(none) private(m) shared(numberOfColumns, numberOfRows)//  num_threads(4)
    for(m = 0; m < 36; m++){

        // create pointers
        double          *inputTest;
        fftw_complex    *outputTest;
        fftw_plan       testPlan;

        // preallocate vectors for FFTW
         outputTest = (fftw_complex*)fftw_malloc(sizeof(fftw_complex)*numberOfRows*numberOfColumns);
         inputTest  = (double *)fftw_malloc(sizeof(double)*numberOfRows*numberOfColumns);

         // confirm that preallocation worked
         if (inputTest == NULL || outputTest == NULL){
             logger_.log_error("\t\t FFTW memory not allocated on m = %i", m);
         }

         // EDIT: insert data into inputTest
         inputTest = someDataSpecificToThisIteration(m); // same size for all m

        // create FFTW plan
        #pragma omp critical (make_plan)
        {
            testPlan = fftw_plan_dft_r2c_2d(numberOfRows, numberOfColumns, inputTest, outputTest, FFTW_ESTIMATE);
        }

         // confirm that plan was created correctly
         if (testPlan == NULL){
             logger_.log_error("\t\t failed to create plan on m = %i", m);
         }

        // execute plan
         fftw_execute(testPlan);

        // clean up
         fftw_free(inputTest);
         fftw_free(outputTest);
         fftw_destroy_plan(testPlan);

    }// end parallelized for loop

This all works fine. However, if I remove the critical construct from around the plan creation (fftw_plan_dft_r2c_2d) my code will fail. Can someone explain why? fftw_plan_dft_r2c_2d isn't really an "orphan", right? Is it because two threads might both try to hit the numberOfRows or numberOfColumns memory location at the same time?

Philippic answered 21/2, 2013 at 20:56 Comment(3)
You are NOT using fftw's multi-threading capabilities. You are actually making 36 single-threaded transforms in parallel.Cristacristabel
I know. I say that in my initial question Each FFT is different, so I'm not relying on FFTW's build-in multithreading . I WANT to make 36 single-threaded transformations in parallel.Philippic
Sorry, my mistake, I read exactly the opposite ;-)Cristacristabel
T
7

It's pretty much all written in the FFTW documentation about thread safety:

... but some care must be taken because the planner routines share data (e.g. wisdom and trigonometric tables) between calls and plans.

The upshot is that the only thread-safe (re-entrant) routine in FFTW is fftw_execute (and the new-array variants thereof). All other routines (e.g. the planner) should only be called from one thread at a time. So, for example, you can wrap a semaphore lock around any calls to the planner; even more simply, you can just create all of your plans from one thread. We do not think this should be an important restriction (FFTW is designed for the situation where the only performance-sensitive code is the actual execution of the transform), and the benefits of shared data between plans are great.

In a typical application of FFT plans are constructed seldom, so it doesn't really matter if you have to synchronise their creation. In your case you don't need to create a new plan at each iteration, unless the dimension of the data changes. You would rather do the following:

#pragma omp parallel default(none) private(m) shared(numberOfColumns, numberOfRows)
{
   // create pointers
   double          *inputTest;
   fftw_complex    *outputTest;
   fftw_plan       testPlan;

   // preallocate vectors for FFTW
   outputTest = (fftw_complex*)fftw_malloc(sizeof(fftw_complex)*numberOfRows*numberOfColumns);
   inputTest  = (double *)fftw_malloc(sizeof(double)*numberOfRows*numberOfColumns);

   // confirm that preallocation worked
   if (inputTest == NULL || outputTest == NULL){
      logger_.log_error("\t\t FFTW memory not allocated on m = %i", m);
   }

   // create FFTW plan
   #pragma omp critical (make_plan)
   testPlan = fftw_plan_dft_r2c_2d(numberOfRows, numberOfColumns, inputTest, outputTest, FFTW_ESTIMATE);

   #pragma omp for
   for (m = 0; m < 36; m++) {
      // execute plan
      fftw_execute(testPlan);
   }

   // clean up
   fftw_free(inputTest);
   fftw_free(outputTest);
   fftw_destroy_plan(testPlan);
}

Now the plans are created only once in each thread and the serialisation overhead would diminish with each execution of fftw_execute(). If running on a NUMA system (e.g. a multi-socket AMD64 or Intel (post-)Nehalem system), then you should enable thread binding in order to achieve maximum performance.

Tankersley answered 21/2, 2013 at 22:24 Comment(7)
I just read that part of the manual...I came back to answer my own question and saw yours. You get the check. You say "unless the dimension of the data changes" but what if the dimensions are the same but the values are different? I updated my original question to reflect this.Philippic
@tir38, that's why you execute the plan multiple times, isn't it? A single plan is OK as long as you reuse the input and output arrays. Just don't assign to inputTest like that since it is a pointer. You would rather have something like someDataSpecificToThisIteration(m, inputData) and have the output from the function put into inputData.Tankersley
sorry again. I meant someDataSpecificToThisIteration[m]. Its not a method call, but a pull from some generic array. So I'm just having inputData be the pointer to that data. Because I have effectively 36 pointers to 36 array entries, I need 36 plans, right?Philippic
@tir38, no, you don't. You just need to execute each plan a bit differently. See here.Tankersley
Excellent. That worked! I'm not sure its faster for my application but it works.Philippic
Unless all arrays fit together in the CPU(s) cache(s), you would be memory bound and won't see any great performance increase. But if using more than one thread gives you access to more memory bandwidth (e.g. on a multisocket system where each CPU has its own memory controller), then you may expect that performance would scale roughly with the bandwidth (e.g. with the number of CPU sockets).Tankersley
You can use the fftw_execute_dft_r2c. Here you can give new pointer to input and output. It is also thread-safe.Emplacement

© 2022 - 2024 — McMap. All rights reserved.