Scaling in inverse FFT by cuFFT

Asked 21/1, 2013 at 14:50 Answered 5/12, 2016 at 16:10

Whenever I'm plotting the values obtained by a programme using the cuFFT and comparing the results with that of Matlab, I'm getting the same shape of graphs and the values of maxima and minima are getting at the same points. However, the values resulting by the cuFFT are much greater than those resulting from Matlab. The Matlab code is

fs = 1000;                              % sample freq
D = [0:1:4]';                           % pulse delay times
t = 0 : 1/fs : 4000/fs;                 % signal evaluation time
w = 0.5;                                % width of each pulse
yp = pulstran(t,D,'rectpuls',w);
filt = conj(fliplr(yp));
xx = fft(yp,1024).*fft(filt,1024);
xx = (abs(ifft(xx)));

and the CUDA code with the same input is like:

cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_FORWARD);
cufftExecC2C(plan, (cufftComplex *)d_filter_signal, (cufftComplex *)d_filter_signal,     CUFFT_FORWARD);
ComplexPointwiseMul<<<blocksPerGrid, threadsPerBlock>>>(d_signal, d_filter_signal, NX);
cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_INVERSE);

The cuFFT performs also a 1024 points FFT with batch size of 2.

With the scaling factor of NX=1024, the values are not coming correct. Please tell what to do.

Peat answered 21/1, 2013 at 14:50 Comment(2)

I don't think there is any easy way to handle scaling directly inside cufft. Either write your own kernel or use thrust later to scale down the signal. – Orfurd 23/1, 2013 at 18:43

note that in the cufft sample code a division by the number of data elements is suggested, to return the original data, after the CUFFT_INVERSE operation. – Unexpressed 29/5, 2014 at 17:49

This is a late answer to remove this question from the unanswered list.

You are not giving enough information to diagnose your problem, since you are missing to specify the way you are setting up the cuFFT plan. You are even not specifying whether you have exactly the same shape for the Matlab's and cuFFT's signals (so you have just a scaling) or you have approximately the same shape. However, let me make the following two observations:

The yp vector has 4000 elements; opposite to thatm by fft(yp,1024), you are performing an FFT by truncating the signal to 1024 elements;
The inverse cuFFT does not perform the scaling by the number of vector elements.

For the sake of convenience (it could be useful to other users), I'm reporting below a simple FFT-IFFT scheme which includes also the scaling performed by using the CUDA Thrust library.

#include <cufft.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>

/*********************/
/* SCALE BY CONSTANT */
/*********************/
class Scale_by_constant
{
    private:
        float c_;

    public:
        Scale_by_constant(float c) { c_ = c; };

        __host__ __device__ float2 operator()(float2 &a) const
        {
            float2 output;

            output.x = a.x / c_;
            output.y = a.y / c_;

            return output;
        }

};

int main(void){

    const int N=4;

    // --- Setting up input device vector    
    thrust::device_vector<float2> d_vec(N,make_cuComplex(1.f,2.f));

    cufftHandle plan;
    cufftPlan1d(&plan, N, CUFFT_C2C, 1);

    // --- Perform in-place direct Fourier transform
    cufftExecC2C(plan, thrust::raw_pointer_cast(d_vec.data()),thrust::raw_pointer_cast(d_vec.data()), CUFFT_FORWARD);

    // --- Perform in-place inverse Fourier transform
    cufftExecC2C(plan, thrust::raw_pointer_cast(d_vec.data()),thrust::raw_pointer_cast(d_vec.data()), CUFFT_INVERSE);

    thrust::transform(d_vec.begin(), d_vec.end(), d_vec.begin(), Scale_by_constant((float)(N)));

    // --- Setting up output host vector    
    thrust::host_vector<float2> h_vec(d_vec);

    for (int i=0; i<N; i++) printf("Element #%i; Real part = %f; Imaginary part: %f\n",i,h_vec[i].x,h_vec[i].y);

    getchar();
}

Servetnick answered 29/5, 2014 at 16:40 Comment(2)

I like the simplicity, but how do you think the performance compares to using an FFT callback? Does the Thrust version add a round trip through global memory? – Topgallant 15/11, 2016 at 19:42

@AndreasYankopolus I have added a further answer analyzing performance. – Servetnick 5/12, 2016 at 16:10

With the introduction of the cuFFT callback feature, the normalization required by the inverse FFT performed by the cuFFT can be embedded directly within the cufftExecC2C call by defining the normalization operation as a __device__ function.

Besides the cuFFT User Guide, for the cuFFT callback features, see

CUDA Pro Tip: Use cuFFT Callbacks for Custom Data Processing

Below is an example of implementing the IFFT normalization by cuFFT callback.

    #include <stdio.h>
    #include <assert.h>

    #include "cuda_runtime.h"
    #include "device_launch_parameters.h"

    #include <cufft.h>
    #include <cufftXt.h>

    /********************/
    /* CUDA ERROR CHECK */
    /********************/
    #define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
    inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
    {
        if (code != cudaSuccess)
        {
            fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
            if (abort) exit(code);
        }
    }

    /*********************/
    /* CUFFT ERROR CHECK */
    /*********************/
    // See https://mcmap.net/q/1468318/-cufft-error-handling
    #ifdef _CUFFT_H_
    static const char *_cudaGetErrorEnum(cufftResult error)
    {
        switch (error)
        {
            case CUFFT_SUCCESS:
                return "CUFFT_SUCCESS";

            case CUFFT_INVALID_PLAN:
                return "CUFFT_INVALID_PLAN";

            case CUFFT_ALLOC_FAILED:
                return "CUFFT_ALLOC_FAILED";

            case CUFFT_INVALID_TYPE:
                 return "CUFFT_INVALID_TYPE";

            case CUFFT_INVALID_VALUE:
                return "CUFFT_INVALID_VALUE";

            case CUFFT_INTERNAL_ERROR:
                return "CUFFT_INTERNAL_ERROR";

            case CUFFT_EXEC_FAILED:
                return "CUFFT_EXEC_FAILED";

            case CUFFT_SETUP_FAILED:
                return "CUFFT_SETUP_FAILED";

            case CUFFT_INVALID_SIZE:
                return "CUFFT_INVALID_SIZE";

            case CUFFT_UNALIGNED_DATA:
                return "CUFFT_UNALIGNED_DATA";
        }

        return "<unknown>";
    }
    #endif

    #define cufftSafeCall(err)      __cufftSafeCall(err, __FILE__, __LINE__)
    inline void __cufftSafeCall(cufftResult err, const char *file, const int line)
    {
        if( CUFFT_SUCCESS != err) {
        fprintf(stderr, "CUFFT error in file '%s', line %d\n %s\nerror %d: %s\nterminating!\n",__FILE__, __LINE__,err, \
                                _cudaGetErrorEnum(err)); \
        cudaDeviceReset(); assert(0); \
        }
    }

    __device__ void IFFT_Scaling(void *dataOut, size_t offset, cufftComplex element, void *callerInfo, void *sharedPtr) {

        float *scaling_factor = (float*)callerInfo;

        float2 output;
        output.x = cuCrealf(element);
        output.y = cuCimagf(element);

        output.x = output.x / scaling_factor[0];
        output.y = output.y / scaling_factor[0];

        ((float2*)dataOut)[offset] = output;
}

    __device__ cufftCallbackStoreC d_storeCallbackPtr = IFFT_Scaling;

    /********/
    /* MAIN */
    /********/
    int main() {

        const int N = 16;

        cufftHandle plan;

        float2 *h_input             = (float2*)malloc(N*sizeof(float2));
        float2 *h_output1           = (float2*)malloc(N*sizeof(float2));
        float2 *h_output2           = (float2*)malloc(N*sizeof(float2));

        float2 *d_input;            gpuErrchk(cudaMalloc((void**)&d_input, N*sizeof(float2)));
        float2 *d_output1;          gpuErrchk(cudaMalloc((void**)&d_output1, N*sizeof(float2)));
        float2 *d_output2;          gpuErrchk(cudaMalloc((void**)&d_output2, N*sizeof(float2)));

        float *h_scaling_factor     = (float*)malloc(sizeof(float));
        h_scaling_factor[0] = 16.0f;
        float *d_scaling_factor;    gpuErrchk(cudaMalloc((void**)&d_scaling_factor, sizeof(float)));
        gpuErrchk(cudaMemcpy(d_scaling_factor, h_scaling_factor, sizeof(float), cudaMemcpyHostToDevice));

        for (int i=0; i<N; i++) {
            h_input[i].x = 1.0f;
            h_input[i].y = 0.f;
        }

        gpuErrchk(cudaMemcpy(d_input, h_input, N*sizeof(float2), cudaMemcpyHostToDevice));

        cufftSafeCall(cufftPlan1d(&plan, N, CUFFT_C2C, 1));

        cufftSafeCall(cufftExecC2C(plan, d_input, d_output1, CUFFT_FORWARD));
        gpuErrchk(cudaMemcpy(h_output1, d_output1, N*sizeof(float2), cudaMemcpyDeviceToHost));
        for (int i=0; i<N; i++) printf("Direct transform - %d - (%f, %f)\n", i, h_output1[i].x, h_output1[i].y);

        cufftCallbackStoreC h_storeCallbackPtr;
        gpuErrchk(cudaMemcpyFromSymbol(&h_storeCallbackPtr, d_storeCallbackPtr, sizeof(h_storeCallbackPtr)));

        cufftSafeCall(cufftXtSetCallback(plan, (void **)&h_storeCallbackPtr, CUFFT_CB_ST_COMPLEX, (void **)&d_scaling_factor));

        cufftSafeCall(cufftExecC2C(plan, d_output1, d_output2, CUFFT_INVERSE));
        gpuErrchk(cudaMemcpy(h_output2, d_output2, N*sizeof(float2), cudaMemcpyDeviceToHost));
        for (int i=0; i<N; i++) printf("Inverse transform - %d - (%f, %f)\n", i, h_output2[i].x, h_output2[i].y);

        cufftSafeCall(cufftDestroy(plan));

        gpuErrchk(cudaFree(d_input));
        gpuErrchk(cudaFree(d_output1));
        gpuErrchk(cudaFree(d_output2));

        return 0;
    }

EDIT

The "moment" the callback operation is performed is specified by CUFFT_CB_ST_COMPLEX in the call to cufftXtSetCallback. Notice that you can then have load and store callbacks with the same cuFFT plan.

Servetnick answered 7/10, 2014 at 21:37 Comment(1)

If you set the callback right after creating the plan, it looks like the scaling would occur on both the forward and inverse FFT, correct? So it's probably best to create a separate plans for the forward FFT (no scaling) and inverse FFT(with scaling). – Topgallant 15/11, 2016 at 22:38

PERFORMANCE

I'm adding a further answer to compare the callback performance with the non-callback version of the same code for this particular case of IFFT scaling. The code I'm using is

#include <stdio.h>
#include <assert.h>

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <cufft.h>
#include <cufftXt.h>

#include <thrust/device_vector.h>

#include "Utilities.cuh"
#include "TimingGPU.cuh"

//#define DISPLAY

/*******************************/
/* THRUST FUNCTOR IFFT SCALING */
/*******************************/
class Scale_by_constant
{
    private:
        float c_;

    public:
        Scale_by_constant(float c) { c_ = c; };

        __host__ __device__ float2 operator()(float2 &a) const
        {
            float2 output;

            output.x = a.x / c_;
            output.y = a.y / c_;

            return output;
        }

};

/**********************************/
/* IFFT SCALING CALLBACK FUNCTION */
/**********************************/
__device__ void IFFT_Scaling(void *dataOut, size_t offset, cufftComplex element, void *callerInfo, void *sharedPtr) {

    float *scaling_factor = (float*)callerInfo;

    float2 output;
    output.x = cuCrealf(element);
    output.y = cuCimagf(element);

    output.x = output.x / scaling_factor[0];
    output.y = output.y / scaling_factor[0];

    ((float2*)dataOut)[offset] = output;
}

__device__ cufftCallbackStoreC d_storeCallbackPtr = IFFT_Scaling;

/********/
/* MAIN */
/********/
int main() {

    const int N = 100000000;

    cufftHandle plan;           cufftSafeCall(cufftPlan1d(&plan, N, CUFFT_C2C, 1));

    TimingGPU timerGPU;

    float2 *h_input             = (float2*)malloc(N*sizeof(float2));
    float2 *h_output1           = (float2*)malloc(N*sizeof(float2));
    float2 *h_output2           = (float2*)malloc(N*sizeof(float2));

    float2 *d_input;            gpuErrchk(cudaMalloc((void**)&d_input, N*sizeof(float2)));
    float2 *d_output1;          gpuErrchk(cudaMalloc((void**)&d_output1, N*sizeof(float2)));
    float2 *d_output2;          gpuErrchk(cudaMalloc((void**)&d_output2, N*sizeof(float2)));

    // --- Callback function parameters
    float *h_scaling_factor     = (float*)malloc(sizeof(float));
    h_scaling_factor[0] = 16.0f;
    float *d_scaling_factor;    gpuErrchk(cudaMalloc((void**)&d_scaling_factor, sizeof(float)));
    gpuErrchk(cudaMemcpy(d_scaling_factor, h_scaling_factor, sizeof(float), cudaMemcpyHostToDevice));

    // --- Initializing the input on the host and moving it to the device
    for (int i = 0; i < N; i++) {
        h_input[i].x = 1.0f;
        h_input[i].y = 0.f;
    }
    gpuErrchk(cudaMemcpy(d_input, h_input, N * sizeof(float2), cudaMemcpyHostToDevice));

    // --- Execute direct FFT on the device and move the results to the host
    cufftSafeCall(cufftExecC2C(plan, d_input, d_output1, CUFFT_FORWARD));
#ifdef DISPLAY
    gpuErrchk(cudaMemcpy(h_output1, d_output1, N * sizeof(float2), cudaMemcpyDeviceToHost));
    for (int i=0; i<N; i++) printf("Direct transform - %d - (%f, %f)\n", i, h_output1[i].x, h_output1[i].y);
#endif

    // --- Execute inverse FFT with subsequent scaling on the device and move the results to the host
    timerGPU.StartCounter();
    cufftSafeCall(cufftExecC2C(plan, d_output1, d_output2, CUFFT_INVERSE));
    thrust::transform(thrust::device_pointer_cast(d_output2), thrust::device_pointer_cast(d_output2) + N, thrust::device_pointer_cast(d_output2), Scale_by_constant((float)(N)));
#ifdef DISPLAY
    gpuErrchk(cudaMemcpy(h_output2, d_output2, N * sizeof(float2), cudaMemcpyDeviceToHost));
    for (int i=0; i<N; i++) printf("Inverse transform - %d - (%f, %f)\n", i, h_output2[i].x, h_output2[i].y);
#endif
    printf("Timing NO callback %f\n", timerGPU.GetCounter());

    // --- Setup store callback
//    timerGPU.StartCounter();
    cufftCallbackStoreC h_storeCallbackPtr;
    gpuErrchk(cudaMemcpyFromSymbol(&h_storeCallbackPtr, d_storeCallbackPtr, sizeof(h_storeCallbackPtr)));
    cufftSafeCall(cufftXtSetCallback(plan, (void **)&h_storeCallbackPtr, CUFFT_CB_ST_COMPLEX, (void **)&d_scaling_factor));

    // --- Execute inverse callback FFT on the device and move the results to the host
    timerGPU.StartCounter();
    cufftSafeCall(cufftExecC2C(plan, d_output1, d_output2, CUFFT_INVERSE));
#ifdef DISPLAY
    gpuErrchk(cudaMemcpy(h_output2, d_output2, N * sizeof(float2), cudaMemcpyDeviceToHost));
    for (int i=0; i<N; i++) printf("Inverse transform - %d - (%f, %f)\n", i, h_output2[i].x, h_output2[i].y);
#endif
    printf("Timing callback %f\n", timerGPU.GetCounter());

    cufftSafeCall(cufftDestroy(plan));

    gpuErrchk(cudaFree(d_input));
    gpuErrchk(cudaFree(d_output1));
    gpuErrchk(cudaFree(d_output2));

    return 0;
}

For such large 1D arrays and simple processing (scaling), the timing on a Kepler K20c is the following

Non-callback 69.029762 ms
Callback     65.868607 ms

So, there is not much improvement. I expect that the improvement one sees is due to avoiding a separate kernel call in the non-callback case. For smaller 1D arrays, there is either no improvement or the non-callback case runs faster.

Servetnick answered 5/12, 2016 at 16:10 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags