Understanding the shape of spectrograms and n_mels

I am going through these two librosa docs: melspectrogram and stft.

I am working on datasets of audio of variable lengths, but I don't quite get the shapes. For example:

(waveform, sample_rate) = librosa.load('audio_file')
spectrogram = librosa.feature.melspectrogram(y=waveform, sr=sample_rate)
dur = librosa.get_duration(waveform)
spectrogram = torch.from_numpy(spectrogram)
print(spectrogram.shape)
print(sample_rate)
print(dur)

Output:

torch.Size([128, 150])
22050
3.48

What I get are the following points:

Sample rate is that you get N samples each second, in this case 22050 samples each second.
The window length is the FFT calculated for that period of length of the audio.
STFT is calculation os FFT in small windows of time of audio.
The shape of the output is (n_mels, t). t = duration/window_of_fft.

I am trying to understand or calculate:

What is n_fft? I mean what exactly is it doing to the audio wave? I read in the documentation the following:

n_fft : int > 0 [scalar]

length of the windowed signal after padding with zeros. The number of rows in the STFT matrix D is (1 + n_fft/2). The default value, n_fft=2048 samples, corresponds to a physical duration of 93 milliseconds at a sample rate of 22050 Hz, i.e. the default sample rate in librosa.

This means that in each window 2048 samples are taken which means that --> 1/22050 * 2048 = 93[ms]. FFT is being calculated for every 93[ms] of the audio?

So, this means that the window size and window is for filtering the signal in this frame?
In the example above, I understand I am getting 128 number of Mel spectrograms but what exactly does that mean?
And what is hop_length? Reading the docs, I understand that it is how to shift the window from one fft window to the next right? If this value is 512 and n_fft = also 512, what does that mean? Does this mean that it will take a window of 23[ms], calculate FFT for this window and skip the next 23[ms]?
How can I specify that I want to overlap from one FFT window to another?

Please help, I have watched many videos of calculating spectrograms but I just can't seem to see it in real life.

The essential parameter to understanding the output dimensions of spectrograms is not necessarily the length of the used FFT (n_fft), but the distance between consecutive FFTs, i.e., the hop_length.

When computing an STFT, you compute the FFT for a number of short segments. These segments have the length n_fft. Usually these segments overlap (in order to avoid information loss), so the distance between two segments is often not n_fft, but something like n_fft/2. The name for this distance is hop_length. It is also defined in samples.

So when you have 1000 audio samples, and the hop_length is 100, you get 10 features frames (note that, if n_fft is greater than hop_length, you may need to pad).

In your example, you are using the default hop_length of 512. So for audio sampled at 22050 Hz, you get a feature frame rate of

frame_rate = sample_rate/hop_length = 22050 Hz/512 = 43 Hz

Again, padding may change this a little.

So for 10s of audio at 22050 Hz, you get a spectrogram array with the dimensions (128, 430), where 128 is the number of Mel bins and 430 the number of features (in this case, Mel spectra).

Recommended topics

Hot tags