I am going through these two librosa docs: melspectrogram
and stft
.
I am working on datasets of audio of variable lengths, but I don't quite get the shapes. For example:
(waveform, sample_rate) = librosa.load('audio_file')
spectrogram = librosa.feature.melspectrogram(y=waveform, sr=sample_rate)
dur = librosa.get_duration(waveform)
spectrogram = torch.from_numpy(spectrogram)
print(spectrogram.shape)
print(sample_rate)
print(dur)
Output:
torch.Size([128, 150])
22050
3.48
What I get are the following points:
- Sample rate is that you get N samples each second, in this case 22050 samples each second.
- The window length is the FFT calculated for that period of length of the audio.
- STFT is calculation os FFT in small windows of time of audio.
- The shape of the output is (n_mels, t). t = duration/window_of_fft.
I am trying to understand or calculate:
What is n_fft? I mean what exactly is it doing to the audio wave? I read in the documentation the following:
n_fft : int > 0 [scalar]
length of the windowed signal after padding with zeros. The number of rows in the STFT matrix D is (1 + n_fft/2). The default value, n_fft=2048 samples, corresponds to a physical duration of 93 milliseconds at a sample rate of 22050 Hz, i.e. the default sample rate in librosa.
This means that in each window 2048 samples are taken which means that --> 1/22050 * 2048 = 93[ms]. FFT is being calculated for every 93[ms] of the audio?
So, this means that the window size and window is for filtering the signal in this frame?
In the example above, I understand I am getting 128 number of Mel spectrograms but what exactly does that mean?
And what is hop_length? Reading the docs, I understand that it is how to shift the window from one fft window to the next right? If this value is 512 and n_fft = also 512, what does that mean? Does this mean that it will take a window of 23[ms], calculate FFT for this window and skip the next 23[ms]?
How can I specify that I want to overlap from one FFT window to another?
Please help, I have watched many videos of calculating spectrograms but I just can't seem to see it in real life.