CUDA cufft 2D example

C

3

6

I am currently working on a program that has to implement a 2D-FFT, (for cross correlation). I did a 1D FFT with CUDA which gave me the correct results, i am now trying to implement a 2D version. With few examples and documentation online i find it hard to find out what the error is.

So far i have been using the cuFFT manual only.

Anyway, i have created two 5x5 arrays and filled them with 1's. I have copied them onto the GPU memory and done the forward FFT, multiplied them and then done ifft on the result. This gives me a 5x5 array with values 650. I would expect to get a DC signal with the value 25 in only one slot in the 5x5 array. Instead i get 650 in the entire array.

Furthermore i am not allowed to print out the value of the signal after it has been copied onto the GPU memory. Writing

cout << d_signal[1].x << endl;

Gives me an acces violation. I have done the same thing in other cuda programs, where this has not been an issue. Does it have something to do with how the complex variable works, or is it human error?

If anyone has any pointers to what is going wrong i would greatly appreciate it. Here is the code

   #include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <helper_functions.h>
#include <helper_cuda.h>

#include <ctime>
#include <time.h>
#include <stdio.h>
#include <iostream>
#include <math.h>
#include <cufft.h>
#include <fstream>

using namespace std;
typedef float2 Complex;





__global__ void ComplexMUL(Complex *a, Complex *b)
{
    int i = threadIdx.x;
    a[i].x = a[i].x * b[i].x - a[i].y*b[i].y;
    a[i].y = a[i].x * b[i].y + a[i].y*b[i].x;
}


int main()
{


    int N = 5;
    int SIZE = N*N;


    Complex *fg = new Complex[SIZE];
    for (int i = 0; i < SIZE; i++){
        fg[i].x = 1; 
        fg[i].y = 0;
    }
    Complex *fig = new Complex[SIZE];
    for (int i = 0; i < SIZE; i++){
        fig[i].x = 1; // 
        fig[i].y = 0;
    }
    for (int i = 0; i < 24; i=i+5)
    {
        cout << fg[i].x << " " << fg[i + 1].x << " " << fg[i + 2].x << " " << fg[i + 3].x << " " << fg[i + 4].x << endl;
    }
    cout << "----------------" << endl;
    for (int i = 0; i < 24; i = i + 5)
    {
        cout << fig[i].x << " " << fig[i + 1].x << " " << fig[i + 2].x << " " << fig[i + 3].x << " " << fig[i + 4].x << endl;
    }
    cout << "----------------" << endl;

    int mem_size = sizeof(Complex)* SIZE;


    cufftComplex *d_signal;
    checkCudaErrors(cudaMalloc((void **) &d_signal, mem_size)); 
    checkCudaErrors(cudaMemcpy(d_signal, fg, mem_size, cudaMemcpyHostToDevice));

    cufftComplex *d_filter_kernel;
    checkCudaErrors(cudaMalloc((void **)&d_filter_kernel, mem_size));
    checkCudaErrors(cudaMemcpy(d_filter_kernel, fig, mem_size, cudaMemcpyHostToDevice));

    // cout << d_signal[1].x << endl;
    // CUFFT plan
    cufftHandle plan;
    cufftPlan2d(&plan, N, N, CUFFT_C2C);

    // Transform signal and filter
    printf("Transforming signal cufftExecR2C\n");
    cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_FORWARD);
    cufftExecC2C(plan, (cufftComplex *)d_filter_kernel, (cufftComplex *)d_filter_kernel, CUFFT_FORWARD);

    printf("Launching Complex multiplication<<< >>>\n");
    ComplexMUL <<< 32, 256 >> >(d_signal, d_filter_kernel);

    // Transform signal back
    printf("Transforming signal back cufftExecC2C\n");
    cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_INVERSE);

    Complex *result = new Complex[SIZE];
    cudaMemcpy(result, d_signal, sizeof(Complex)*SIZE, cudaMemcpyDeviceToHost);

    for (int i = 0; i < SIZE; i=i+5)
    {
        cout << result[i].x << " " << result[i + 1].x << " " << result[i + 2].x << " " << result[i + 3].x << " " << result[i + 4].x << endl;
    }

    delete result, fg, fig;
    cufftDestroy(plan);
    //cufftDestroy(plan2);
    cudaFree(d_signal);
    cudaFree(d_filter_kernel);

}

The above code gives the following terminal output:

1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
----------------
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
----------------
Transforming signal cufftExecR2C
Launching Complex multiplication<<< >>>
Transforming signal back cufftExecC2C

625 625 625 625 625
625 625 625 625 625
625 625 625 625 625
625 625 625 625 625
625 625 625 625 625

Clovis answered 27/4, 2016 at 12:6 Comment(2)

The code you have posted is incomplete and can't be compiled. Could you please fix that. It is very hard to tell you what might be wrong without compiling and running the code, and I can't do that right now- – Embody 27/4, 2016 at 12:11

Sure thing, i had some uncommented section i did not want to include. I have removed it and edited everything into my post now. – Clovis 27/4, 2016 at 12:16

I

2

This gives me a 5x5 array with values 650 : It reads 625 which is 5555. The convolution algorithm you are using requires a supplemental divide by NN. Indeed, in cufft, there is no normalization coefficient in the forward transform. Hence, your convolution cannot be the simple multiply of the two fields in frequency domain. (some would call it the mathematicians DFT and not the physicists DFT).

Furthermore i am not allowed to print out the value of the signal after it has been copied onto the GPU memory: This is standard CUDA behavior. When allocating memory on the device, the data exists in device memory address space, and cannot be accessed by the CPU without additionnal effort. Search for managed memory, or zerocopy to have data accessible from both sides of the PCI Express (this is discussed in many other posts).

Iconoduly answered 27/4, 2016 at 13:16 Comment(1)

Thanks for your reply Florenti appreciate it. This has helped me alot! – Clovis 27/4, 2016 at 13:43

E

2

There are several problems here:

You are launching far too many threads for the size of the input arrays in the multiplication kernel, so that should be failing with out-of-bounds memory errors. I am surprised you are not receiving any sort of runtime error.
Your expected solution from the fft/fft - dot product - ifft sequence is, I believe, incorrect. The correct solution would be a 5x5 matrix with 25 in each entry.
As clearly described in the cuFFT documentation, the library performs unnormalised FFTs:

cuFFT performs un-normalized FFTs; that is, performing a forward FFT on an input data set followed by an inverse FFT on the resulting set yields data that is equal to the input, scaled by the number of elements. Scaling either transform by the reciprocal of the size of the data set is left for the user to perform as seen fit.

So by my reckoning, the correct output solution for your code should be a 5x5 matrix with 625 in each entry, which would be normalised to a 5x5 matrix with 25 in each entry, ie. the expected result. I don't understand how the problem at (1) isn't producing different results as the multiplication kernel should be failing.

TLDR; nothing to see here, move along...

Embody answered 27/4, 2016 at 12:6 Comment(4)

accessing valid memory region on GPU, even if not allocated, does not necessarily issue an error outside of cuda mem check tests. Kernel does not necessarily fail for that small overflow. All your points remain valid though. – Iconoduly 27/4, 2016 at 13:26

@FlorentDUGUET: the input array is 25 double words. The kernel launch is using 256 threads per block. When I ran it (and yes I did run it), it produced literally hundreds of invalid memory access errors in cuda-memcheck. – Embody 27/4, 2016 at 13:40

Thanks for your reply talonmies i appreciate it. This has helped me alot! – Clovis 27/4, 2016 at 13:43

@Embody I do believe that you had hundreds of invalid memory access error (I said does not necessarily issue an error outside of cuda mem check). Yet, kernel does not fail. I don't know if it anyway relates to the current status of sharing GPU to multiple VMs allowing CUDA, not just OpenGL. – Iconoduly 27/4, 2016 at 13:47

I

2

This gives me a 5x5 array with values 650 : It reads 625 which is 5555. The convolution algorithm you are using requires a supplemental divide by NN. Indeed, in cufft, there is no normalization coefficient in the forward transform. Hence, your convolution cannot be the simple multiply of the two fields in frequency domain. (some would call it the mathematicians DFT and not the physicists DFT).

Furthermore i am not allowed to print out the value of the signal after it has been copied onto the GPU memory: This is standard CUDA behavior. When allocating memory on the device, the data exists in device memory address space, and cannot be accessed by the CPU without additionnal effort. Search for managed memory, or zerocopy to have data accessible from both sides of the PCI Express (this is discussed in many other posts).

Iconoduly answered 27/4, 2016 at 13:16 Comment(1)

Thanks for your reply Florenti appreciate it. This has helped me alot! – Clovis 27/4, 2016 at 13:43

S

0

Just as an add up to the other things mentioned already: I think your complex multiplication kernel is not doing the right thing. You are overwriting a[i].x in the first line and then use the new value of a[i].x in the second line to calculate a[i].y. I think you need to first generate a backup of a[i].x before you overwrite, something like:

float aReal_bk = a[i].x;
a[i].x = a[i].x * b[i].x - a[i].y * b[i].y;
a[i].y = aReal_bk * b[i].y + a[i].y * b[i].x;

Solange answered 26/9, 2021 at 6:6 Comment(0)

Recommended topics

Hot tags