FFTW vs Matlab FFT

Asked 8/3, 2013 at 19:5 Answered 17/10, 2014 at 4:29

I posted this on matlab central but didn't get any responses so I figured I'd repost here.

I recently wrote a simple routine in Matlab that uses an FFT in a for-loop; the FFT dominates the calculations. I wrote the same routine in mex just for experimentation purposes and it calls the FFTW 3.3 library. It turns out that the matlab routine runs faster than the mex routine for very large arrays (about twice as fast). The mex routine uses wisdom and and performs the same FFT calculations. I also know matlab uses FFTW, but is it possible their version is slightly more optimized? I even used the FFTW_EXHAUSTIVE flag and its still about twice as slow for large arrays than the MATLAB counterpart. Furthermore I ensured the matlab I used was single threaded with the "-singleCompThread" flag and the mex file I used was not in debug mode. Just curious if this was the case - or if there are some optimizations matlab is using under the hood that I dont know about. Thanks.

Here's the mex portion:

void class_cg_toeplitz::analysis() {
// This method computes CG iterations using FFTs
    // Check for wisdom
    if(fftw_import_wisdom_from_filename("cd.wis") == 0) {
        mexPrintf("wisdom not loaded.\n");
    } else {
        mexPrintf("wisdom loaded.\n");
    }

    // Set FFTW Plan - use interleaved FFTW
    fftw_plan plan_forward_d_buffer;    
    fftw_plan plan_forward_A_vec;       
    fftw_plan plan_backward_Ad_buffer;
    fftw_complex *A_vec_fft;
    fftw_complex *d_buffer_fft;
    A_vec_fft = fftw_alloc_complex(n);
    d_buffer_fft = fftw_alloc_complex(n);

    // CREATE MASTER PLAN - Do this on an empty vector as creating a plane 
    // with FFTW_MEASURE will erase the contents; 
    // Use d_buffer
    // This is somewhat dangerous because Ad_buffer is a vector; but it does not
    // get resized so &Ad_buffer[0] should work
    plan_forward_d_buffer = fftw_plan_dft_r2c_1d(d_buffer.size(),&d_buffer[0],d_buffer_fft,FFTW_EXHAUSTIVE);
    plan_forward_A_vec = fftw_plan_dft_r2c_1d(A_vec.height,A_vec.value,A_vec_fft,FFTW_WISDOM_ONLY);
    // A_vec_fft.*d_buffer_fft will overwrite d_buffer_fft
    plan_backward_Ad_buffer = fftw_plan_dft_c2r_1d(Ad_buffer.size(),d_buffer_fft,&Ad_buffer[0],FFTW_EXHAUSTIVE);

    // Get A_vec_fft
    fftw_execute(plan_forward_A_vec);

    // Find initial direction - this is the initial residual
    for (int i=0;i<n;i++) {
        d_buffer[i] = b.value[i];
        r_buffer[i] = b.value[i];
    }    

    // Start CG iterations
    norm_ro = norm(r_buffer);
    double fft_reduction = (double)Ad_buffer.size(); // Must divide by size of vector because inverse FFT does not do this
    while (norm(r_buffer)/norm_ro > relativeresidual_cutoff) {        
        // Find Ad - use fft
        fftw_execute(plan_forward_d_buffer);    
        // Get A_vec_fft.*fft(d) - A_vec_fft is only real, but d_buffer_fft
        // has complex elements; Overwrite d_buffer_fft        
        for (int i=0;i<n;i++) {
            d_buffer_fft[i][0] = d_buffer_fft[i][0]*A_vec_fft[i][0]/fft_reduction;
            d_buffer_fft[i][1] = d_buffer_fft[i][1]*A_vec_fft[i][0]/fft_reduction;
        }        
        fftw_execute(plan_backward_Ad_buffer); 

        // Calculate r'*r
        rtr_buffer = 0;
        for (int i=0;i<n;i++) {
            rtr_buffer = rtr_buffer + r_buffer[i]*r_buffer[i];
        }    

        // Calculate alpha
        alpha = 0;
        for (int i=0;i<n;i++) {
            alpha = alpha + d_buffer[i]*Ad_buffer[i];
        }    
        alpha = rtr_buffer/alpha;

        // Calculate new x
        for (int i=0;i<n;i++) {
            x[i] = x[i] + alpha*d_buffer[i];
        }   

        // Calculate new residual
        for (int i=0;i<n;i++) {
            r_buffer[i] = r_buffer[i] - alpha*Ad_buffer[i];
        }   

        // Calculate beta
        beta = 0;
        for (int i=0;i<n;i++) {
            beta = beta + r_buffer[i]*r_buffer[i];
        }  
        beta = beta/rtr_buffer;

        // Calculate new direction vector
        for (int i=0;i<n;i++) {
            d_buffer[i] = r_buffer[i] + beta*d_buffer[i];
        }  

        *total_counter = *total_counter+1;
        if(*total_counter >= iteration_cutoff) {
            // Set total_counter to -1, this indicates failure
            *total_counter = -1;
            break;
        }
    }

    // Store Wisdom
    fftw_export_wisdom_to_filename("cd.wis");

    // Free fft alloc'd memory and plans
    fftw_destroy_plan(plan_forward_d_buffer);
    fftw_destroy_plan(plan_forward_A_vec);
    fftw_destroy_plan(plan_backward_Ad_buffer);
    fftw_free(A_vec_fft);
    fftw_free(d_buffer_fft);
};

Here's the matlab portion:

% Take FFT of A_vec.
A_vec_fft = fft(A_vec); % Take fft once

% Find initial direction - this is the initial residual 
x = zeros(n,1); % search direction
r = zeros(n,1); % residual
d = zeros(n+(n-2),1); % search direction; pad to allow FFT
for i = 1:n
    d(i) = b(i); 
    r(i) = b(i); 
end

% Enter CG iterations
total_counter = 0;
rtr_buffer = 0;
alpha = 0;
beta = 0;
Ad_buffer = zeros(n+(n-2),1); % This holds the product of A*d - calculate this once per iteration and using FFT; only 1:n is used
norm_ro = norm(r);

while(norm(r)/norm_ro > 10^-6)
    % Find Ad - use fft
    Ad_buffer = ifft(A_vec_fft.*fft(d)); 

    % Calculate rtr_buffer
    rtr_buffer = r'*r;

    % Calculate alpha    
    alpha = rtr_buffer/(d(1:n)'*Ad_buffer(1:n));

    % Calculate new x
    x = x + alpha*d(1:n);

    % Calculate new residual
    r = r - alpha*Ad_buffer(1:n);

    % Calculate beta
    beta = r'*r/(rtr_buffer);

    % Calculate new direction vector
    d(1:n) = r + beta*d(1:n);      

    % Update counter
    total_counter = total_counter+1; 
end

In terms of time, for N = 50000 and b = 1:n it takes about 10.5 seconds with mex and 4.4 seconds with matlab. I'm using R2011b. Thanks

Photoelectrotype answered 8/3, 2013 at 19:5 Comment(15)

What are the dimensions of your data, and what are the absolute times? – Rub 8/3, 2013 at 19:17

Are they both in-place ffts? – Nusku 8/3, 2013 at 19:22

you could run your Matlab code with the profiler turned on to get more detailed information on the time spent in each function (in percent), this could give a hint were Matlab is optimized – Cenogenesis 8/3, 2013 at 19:44

@Damage I ran the profiler on the matlab portion; almost all of it is spent on the FFT. I also ran valgrind on the mex and basically all of it is spent on the FFT as well. Of the 4.4 seconds for the matlab portion, the profiler says 4 seconds are spent on the FFT in matlab. For the mex portion, valgrind says 84.99% is spent on fftw_execute. – Photoelectrotype 8/3, 2013 at 20:2

interesting. I just checked: matlab also has a fftw command which allows to control the optimization parameters used internally for the fftw lib(->help fftw). with this command you can also get the wisdom database matlab has been using for its computations. it would be interesting what results you get when you feed matlabs wisdom database to your c++ program and vice versa... – Cenogenesis 8/3, 2013 at 20:37

I also discovered that matlab uses a non deterministic number of iterations to converge. For n - 1000, it takes between 83-85 iterations... @damage also, my matlab fftw version and the version I installed are different (3.3.3 vs 3.2.2) so the wisdoms arent compatible from what I tried. – Photoelectrotype 8/3, 2013 at 21:21

In Matlab's bin/<PLATFORM> you can find file 'fftw.spec' which specifies different libraries for different CPU's - so I'd say that libraries are specially optimized. – Menchaca 12/3, 2013 at 12:56

Are you using the same residual criteria for both? I see 10e-6 in the matlab script but only relativeresidual_cutoff in the mex, whose definition isn't shown. – Rockbottom 12/3, 2013 at 15:3

I think, you need to build your fftw based on a multi-threaded LAPACK. – Cricket 12/3, 2013 at 18:45

@aircooled I checked bin/<platform> and there weren't any files named fftw.spec. @ CaptainMurphy Both cutoff criterion are 10e-6. @ iampat I wanted a direct comparison with single threaded performance. Matlab still outperforms mex even with single threaded performance. – Photoelectrotype 12/3, 2013 at 21:11

This may be trivial and irrelevant, but I do notice 2 calls to fftw_execute() in the .mex code; but only 1 in the matlab. I assume there's something obvious I'm missing, but I thought I'd comment. – Transmundane 13/3, 2013 at 2:35

@DanNissenbaum In matlab: Ad_buffer = ifft(A_vec_fft.*fft(d)); This line is two calls. The first is a forward fft and the second is the inverse. Thats whats being done in the mex version. – Photoelectrotype 13/3, 2013 at 2:45

do both datasets have length equal to power of 2? because that will speed up the FFT calculation. – Unite 13/3, 2013 at 15:46

@jucestain: you might be interested to know that fft just got more performance improvements in latest MATLAB version R2013a (on CPUs that support AVX instruction set): mathworks.com/help/matlab/release-notes.html#btsiwqu-1 – Walkling 15/3, 2013 at 19:13

@Walkling Very interesting. I'm reading about assembly now so hopefully I'll be able to understand the AVX stuff in the near future. Maybe this also suggests matlab implements improvements themselves outside of fftw. – Photoelectrotype 16/3, 2013 at 0:36

A few observations rather than a definite answer since I do not know any of the specifics of the MATLAB FFT implementation:

Based on the code you have, I can see two explanations for the speed difference:
- the speed difference is explained by differences in levels of optimization of the FFT
- the while loop in MATLAB is executed a significantly smaller number of times

I will assume you already looked into the second issue and that the number of iterations are comparable. (If they aren't, this is most likely to some accuracy issues and worth further investigations.)

Now, regarding FFT speed comparison:

Yes, the theory is that FFTW is faster than other high-level FFT implementations but it is only relevant as long as you compare apples to apples: here you are comparing implementations at a level further down, at the assembly level, where not only the selection of the algorithm but its actual optimization for a specific processor and by software developers with varying skills comes at play
I have optimized or reviewed optimized FFTs in assembly on many processors over the year (I was in the benchmarking industry) and great algorithms are only part of the story. There are considerations that are very specific to the architecture you are coding for (accounting for latencies, scheduling of instructions, optimization of register usage, arrangement of data in memory, accounting for branch taken/not taken latencies, etc.) and that make differences as important as the selection of the algorithm.
With N=500000, we are also talking about large memory buffers: yet another door for more optimizations that can quickly get pretty specific to the platform you run your code on: how well you manage to avoid cache misses won't be dictated by the algorithm so much as by how the data flow and what optimizations a software developer may have used to bring data in and out of memory efficiently.
Though I do not know the details of the MATLAB FFT implementation, I am pretty sure that an army of DSP engineers has been (and is still) honing on its optimization as it is key to so many designs. This could very well mean that MATLAB had the right combination of developers to produce a much faster FFT.

Ow answered 13/3, 2013 at 2:9 Comment(8)

Lolo, the crux of the issue is that MATLAB implements FFTW. Strangely enough the number of iterations until convergence on the MATLAB routine appears to be non deterministic. for N = 1000, it takes 83-85 whereas the mex version is constant (85 for N = 1000 if I recall correctly). At this point I've sort of just concluded matlab must be doing something "under the hood" that I'm not aware about... Either that or my mex implementation is slower because I missed an optimization somewhere. Not sure. – Photoelectrotype 13/3, 2013 at 2:33

@jucestain Everything you say points to the same conclusion to me: 83-85 vs. 85 means that the FFT performance alone explains the difference, and so does your 90% vs. 84.99% profiling data. The MATLAB implementation is simply better optimized, which is plausible with an algorithm like that with so many opportunities for optimization at every stage. I wouldn't qualify that as "under the hood" tricks but just time well spent in creating an FFT implementation that is optimized at a better level than the MEX counterpart you use. I do not think you are missing anything in your mex implementation. – Ow 13/3, 2013 at 5:45

@jucestain As Lolo is saying, nothing is happening "under the hood", Matlab has a better optimized implementation from the MKL, check my answer, it answers your question... – Noto 15/3, 2013 at 7:53

@Noto Are you sure MATLAB uses MKL for their fft (if this is the case I'll accept your answer)? I thought MATLAB uses fftw. They dont explicitly state what they use in their documentation for fft but they have citations to FFTW.org and to the FFTW paper. They also have a function "fftw" that allows you to mess with wisdom. However, its possible they use MKL for their fft perhaps for people with intel processors (im using an i7). I know FFTW uses codelets for different sized ffts. Maybe it's possible that matlab's codelets are more optimized than the ones provided by fftw.org. – Photoelectrotype 15/3, 2013 at 18:3

The plot thickens: software.intel.com/en-us/articles/…. – Photoelectrotype 15/3, 2013 at 18:11

@jucestain Yes I'm pretty sure, will check on monday to be absolutely sure (I only have Matlab at my office)... And the article you provide of Intel just states that Intel MKL supports the FFTW interfaces, but the underlying implementation is specific from Intel, and they know how their own processors work very well, so they optimize very efficiently. Better than FFTW developers can. I really think that explains the performance difference. – Noto 15/3, 2013 at 18:21

Just an fyi, I would have awarded the full bounty but this answer is still just speculation... – Photoelectrotype 19/3, 2013 at 6:59

@jucestain I did more research and seems like my answer is completely wrong... :) My research skills are sadly too limited to understand everything with 100% certainty, but seems like Matlab has its own implementation of fftw in: libmwfftw.dll, which does not seem to rely on libfftw3.dll (which is packed with Matlab anyway...) This might seem strange, but as far as I can tell, I see no dependency to libfftw3.dll... So Lolo might be right, or damage is right, and it's the Matlab wisdom which speeds up the code. Have you tried to deactivate the Matlab wisdom with fftw('wisdom', [])? – Noto 19/3, 2013 at 10:56

This is classic performance gain thanks to low-level and architecture-specific optimization.

Matlab uses FFT from the Intel MKL (Math Kernel Library) binary (mkl.dll). These are routines optimized (at assembly level) by Intel for Intel processors. Even on AMD's it seems to give nice performance boosts.

FFTW seems like a normal c library that is not as optimized. Hence the performance gain to use the MKL.

Noto answered 14/3, 2013 at 9:42 Comment(4)

MATLAB ships with its own build of the open source FFTW library, compiled with multithreading support and SSE/AVX vectorized instructions. Calling version('-fftw') shows FFTW-3.3.3-sse2-avx. There are two shared libraries found in MATLAB bin folder that export the FFTW API interface: libmwfftw3.dll and libmwfftw3f.dll (in addition to a third lib libmwmfl_fft.dll built on top of the previous two intended to abstract the use of FFTW plans). So even though MATLAB uses Intel MKL as optimized BLAS/LAPACK implementation, it is not calling the FFTW interface from MKL as far as I can tell. – Walkling 7/5, 2014 at 4:12

@Walkling Thanks for the clarification! BTW, just to know, how did you find out that these two binaries export the FFTW API interface? And do you know what's the difference between both binaries? In my R2010a I only have one libmwfftw.dll library... – Noto 7/5, 2014 at 6:16

I simply use Dependency Walker to get a list of exported functions by any DLL (you'll see familiar functions like fftw_plan_dft_1d, fftw_execute, etc..). The first DLL corresponds to the double-precision version of FFTW, the second one is the single-precision version (I have the latest MATLAB R2014a). I forgot to say that there are also two other DLL files implementing the distributed-memory parallel version of FFTW which uses MPI (look for libmwfftw3_mpi.dll and libmwfftw3f_mpi.dll) – Walkling 7/5, 2014 at 7:30

also if you have the PCT toolbox, fft can run on the GPU, which is implemented using the cuFFT library (look for the cufft*.dll file) – Walkling 7/5, 2014 at 7:44

I have found the following comment on the MathWorks website [1]:

Note on large powers of 2: For FFT dimensions that are powers of 2, between 2^14 and 2^22, MATLAB software uses special preloaded information in its internal database to optimize the FFT computation. No tuning is performed when the dimension of the FTT is a power of 2, unless you clear the database using the command fftw('wisdom', []).

Although it relates to powers of 2, it may hint upon that MATLAB employs its own 'special wisdom' when using FFTW for certain (large) array sizes. Consider: 2^16 = 65536.

[1] R2013b Documentation available from http://www.mathworks.de/de/help/matlab/ref/fftw.html (accessed on 29 Oct 2013)

Reverential answered 29/10, 2013 at 12:5 Comment(0)

EDIT: @wakjah 's reply to this answer is accurate: FFTW does support split real and imaginary memory storage via its Guru interface. My claim about hacking is thus not accurate but can very well apply if FFTW's Guru interface is not used - which is the case by default, so beware still!

First, sorry for being a year late. I'm not convinced that the speed increase you see comes from MKL or other optimizations. There is something quite fundamentally different between FFTW and Matlab, and that is how complex data is stored in memory.

In Matlab, the real and imaginary parts of a complex vector X are separate arrays Xre[i] and Xim[i] (linear in memory, efficient when operating on either of them separately).

In FFTW, the real and imaginary parts are interlaced as double[2] by default, i.e. X[i][0] is the real part, and X[i][1] is the imaginary part.

Thus, to use the FFTW library in mex files one cannot use the Matlab array directly, but must allocate new memory first, then pack the input from Matlab into FFTW format, and then unpack the output from FFTW into Matlab format. i.e.

X = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
Y = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);

then

for (size_t i=0; i<N; ++i) {
    X[i][0] = Xre[i];
    X[i][1] = Xim[i];
}

then

for (size_t i=0; i<N; ++i) {
    Yre[i] = Y[i][0];
    Yim[i] = Y[i][1];
}

Hence, this requires 2x memory allocations + 4x reads + 4x writes -- all of size N. This does take a toll speed-wise on large problems.

I have a hunch that Mathworks may have hacked the FFTW3 code to allow it to read input vectors directly in the Matlab format, which avoids all of the above.

In this scenario, one can only allocate X and use X for Y to run FFTW in-place (as fftw_plan_*(N, X, X, ...) instead of fftw_plan_*(N, X, Y, ...)), since it'll be copied to the Yre and Yim Matlab vector, unless the application requires/benefits from keeping X and Y separate.

EDIT: Looking at the memory consumption in real-time when running Matlab's fft2() and my code based on the fftw3 library, it shows that Matlab only allocates only one additional complex array (the output), whereas my code needs two such arrays (the *fftw_complex buffer plus the Matlab output). An in-place conversion between the Matlab and fftw formats is not possible because the Matlab's real and imaginary arrays are not consecutive in memory. This suggests that Mathworks hacked the fftw3 library to read/write the data using the Matlab format.

One other optimization for multiple calls, is to allocate persistently (using mexMakeMemoryPersistent()). I'm not sure if the Matlab implementation does this as well.

Cheers.

p.s. As a side note, the Matlab complex data storage format is more efficient for operating on the real or imaginary vectors separately. On FFTW's format you'd have to do ++2 memory reads.

Arlindaarline answered 17/10, 2014 at 4:29 Comment(2)

Except that the FFTW Guru Interface supports split real and complex arrays - i.e., the same as the MATLAB format - no hacking required. – Pharmacopsychosis 5/12, 2014 at 22:44

@Pharmacopsychosis I stand corrected, +1 and thanks! I edited my answer to reflect your reply. – Arlindaarline 23/2, 2016 at 2:28

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags