Converting an FFT to a spectogram

Asked 5/11, 2009 at 11:33 Answered 17/12, 2011 at 0:30

I have an audio file and I am iterating through the file and taking 512 samples at each step and then passing them through an FFT.

I have the data out as a block 514 floats long (Using IPP's ippsFFTFwd_RToCCS_32f_I) with real and imaginary components interleaved.

My problem is what do I do with these complex numbers once i have them? At the moment I'm doing for each value

const float realValue   = buffer[(y * 2) + 0];
const float imagValue   = buffer[(y * 2) + 1];
const float value       = sqrt( (realValue * realValue) + (imagValue * imagValue) );

This gives something slightly usable but I'd rather some way of getting the values out in the range 0 to 1. The problem with he above is that the peaks end up coming back as around 9 or more. This means things get viciously saturated and then there are other parts of the spectrogram that barely shows up despite the fact that they appear to be quite strong when I run the audio through audition's spectrogram. I fully admit I'm not 100% sure what the data returned by the FFT is (Other than that it represents the frequency values of the 512 sample long block I'm passing in). Especially my understanding is lacking on what exactly the compex number represents.

Any advice and help would be much appreciated!

Edit: Just to clarify. My big problem is that the FFT values returned are meaningless without some idea of what the scale is. Can someone point me towards working out that scale?

Edit2: I get really nice looking results by doing the following:

size_t count2   = 0;
size_t max2     = kFFTSize + 2;
while( count2 < max2 )
{
    const float realValue   = buffer[(count2) + 0];
    const float imagValue   = buffer[(count2) + 1];
    const float value   = (log10f( sqrtf( (realValue * realValue) + (imagValue * imagValue) ) * rcpVerticalZoom ) + 1.0f) * 0.5f;
    buffer[count2 >> 1] = value;
    count2 += 2;
}

To my eye this even looks better than most other spectrogram implementations I have looked at.

Is there anything MAJORLY wrong with what I'm doing?

Ike answered 5/11, 2009 at 11:33 Comment(2)

You're doing the right thing in getting the magnitude of the complex number. You just need to find out the scale of these (complex) numbers (0-1, 0-255, ..?), see the docs of your FFT function for that. If the range is too big for your liking, taking a log() of the magnitude should help, as suggested below. – Kayseri 5/11, 2009 at 14:21

Probably not important to your usage, but you could also normalize the frequency domain values (ie. the values you get from the FFT) by dividing them by the FFT width. (ie. the wider your FFT is, the larger the values in the various frequency buckets will be) – Snooker 7/1, 2010 at 21:4

The usual thing to do to get all of an FFT visible is to take the logarithm of the magnitude.

So, the position of the output buffer tells you what frequency was detected. The magnitude (L2 norm) of the complex number tells you how strong the detected frequency was, and the phase (arctangent) gives you information that is a lot more important in image space than audio space. Because the FFT is discrete, the frequencies run from 0 to the nyquist frequency. In images, the first term (DC) is usually the largest, and so a good candidate for use in normalization if that is your aim. I don't know if that is also true for audio (I doubt it)

Ottavia answered 5/11, 2009 at 14:28 Comment(6)

Interesting response. Just note that in audio, there is normally no DC value (it would destroy your speakers if let through your amp), it's purely AC. – Kayseri 5/11, 2009 at 14:29

Anyway, looking for the maximum value is a pretty short operation (compared to the FFT). – District 5/11, 2009 at 14:31

ditto on using log scale (and finding the maximum) – Lard 5/11, 2009 at 14:36

@Kayseri I'm glad to hear my intuition isn't completely haywire. – Ottavia 5/11, 2009 at 14:39

In audio, the phase is important for being able to get back from the spectrum to the original signal, ie. that's why you are not able to reconstruct the original signal from a spectrum only. But that's not what you typically do with spectrum :) – Comfort 5/11, 2009 at 15:12

Well log10( sqrt( real^2 + imag^2 ) ) definitely gives nicer looking results ... – Ike 5/11, 2009 at 15:15

For each window of 512 sample, you compute the magnitude of the FFT as you did. Each value represents the magnitude of the corresponding frequency present in the signal.

mag
 /\
 |
 |      !         !
 |      !    !    !
 +--!---!----!----!---!--> freq
 0          Fs/2      Fs

Now we need to figure out the frequencies.

Since the input signal is of real values, the FFT is symmetric around the middle (Nyquist component) with the first term being the DC component. Knowing the signal sampling frequency Fs, the Nyquist frequency is Fs/2. And therefore for the index k, the corresponding frequency is k*Fs/512

So for each window of length 512, we get the magnitudes at specified frequency. The group of those over consecutive windows form the spectrogram.

Phatic answered 5/11, 2009 at 15:9 Comment(0)

Just so people know I've done a LOT of work on this whole problem. The main thing I've discovered is that the FFT requires normalisation after doing it.

To do this you average all the values of your window vector together to get a value somewhat less than 1 (or 1 if you are using a rectangular window). You then divide that number by the number of frequency bins you have post the FFT transform.

Finally you divide the actual number returned by the FFT by the normalisation number. Your amplitude values should now be in the -Inf to 1 range. Log, etc, as you please. You will still be working with a known range.

Ike answered 17/12, 2011 at 0:30 Comment(0)

There are a few things that I think you will find helpful.

The forward FT will tend to give larger numbers in the output than in the input. You can think of it as all of the intensity at a certain frequency being displayed at one place rather than being distributed through the dataset. Does this matter? Probably not because you can always scale the data to fit your needs. I once wrote an integer based FFT/IFFT pair and each pass required rescaling to prevent integer overflow.

The real data that are your input are converted into something that is almost complex. As it turns out buffer[0] and buffer[n/2] are real and independent. There is a good discussion of it here.

The input data are sound intensity values taken over time, equally spaced. They are said to be, appropriately enough, in the time domain. The output of the FT is said to be in the frequency domain because the horizontal axis is frequency. The vertical scale remains intensity. Although it isn't obvious from the input data, there is phase information in the input as well. Although all of the sound is sinusoidal, there is nothing that fixes the phases of the sine waves. This phase information appears in the frequency domain as the phases of the individual complex numbers, but often we don't care about it (and often we do too!). It just depends upon what you are doing. The calculation

const float value = sqrt((realValue * realValue) + (imagValue * imagValue));

retrieves the intensity information but discards the phase information. Taking the logarithm essentially just dampens the big peaks.

Hope this is helpful.

Spheroidal answered 5/11, 2009 at 16:19 Comment(1)

well so how would i use it without discarding the phase information? And how is phase applicable to a spectrogram? – Ike 5/11, 2009 at 17:20

If you are getting strange results then one thing to check is the documentation for the FFT library to see how the output is packed. Some routines use a packed format where real/imaginary values are interleaved, or they may begin at the N/2 element and wrap around.

For a sanity check I would suggest creating sample data with known characteristics, eg Fs/2, Fs/4 (Fs = sample frequency) and compare the output of the FFT routine with what you'd expect. Try creating both a sine and cosine at the same frequency, as these should have the same magnitude in the spectrum, but have different phases (ie the realValue/imagValue will differ, but the sum of squares should be the same.

If you're intending on using the FFT though then you really need to know how it works mathematically, otherwise you're likely to encounter other strange problems such as aliasing.

Rozamond answered 5/11, 2009 at 13:56 Comment(1)

Well I have checked the profile. My issue is that the numbers I get back from the FFT are meaningless without any idea of what the scale represents. I will update my original question. – Ike 5/11, 2009 at 14:14

Recommended topics

Hot tags