I would like to point out this question and answer in particular: How do I obtain the frequencies of each value in an FFT?. In addition to consulting the documentation for the STFT from librosa, we know that the horizontal axis is the time axis while the vertical axis are the frequencies. Each column in the spectrogram is the FFT of a slice in time where the centre at this time point has a window placed with n_fft=256
components.
We also know that there is a hop length which tells us how many audio samples we need to skip over before we calculate the next FFT. This by default is n_fft / 4
, so every 256 / 4 = 64 points in your audio, we calculate a new FFT centered at this time point of n_fft=256
points long. If you want to know the exact time point each window is centered at, that is simply i / Fs
with i
being the index of the audio signal which would be a multiple of 64.
Now, for each FFT window, for real signals the spectrum is symmetric so we only consider the positive side of the FFT. This is verified by the documentation where the number of rows and hence the number of frequency components is 1 + n_fft / 2
with 1 being the DC component. Since we have this now, consulting the post above the relationship from bin number to the corresponding frequency is i * Fs / n_fft
with i
being the bin number, Fs
being the sampling frequency and n_fft=256
as the number of points in the FFT window. Since we are only looking at the half spectrum, instead of i
spanning from 0 to n_fft
, this spans from 0 up to 1 + n_fft / 2
instead as the bins beyond 1 + n_fft / 2
would simply be the reflected version of the half spectrum and so we do not consider the frequency components beyond Fs / 2
Hz.
If you wanted to generate a NumPy array of these frequencies, you could just do:
import numpy as np
freqs = np.arange(0, 1 + n_fft / 2) * Fs / n_fft
freqs
would be an array that maps the bin number in the FFT to the corresponding frequency. As an illustrative example, suppose our sampling frequency is 16384 Hz, and n_fft = 256
. Therefore:
In [1]: import numpy as np
In [2]: Fs = 16384
In [3]: n_fft = 256
In [4]: np.arange(0, 1 + n_fft / 2) * Fs / n_fft
Out[4]:
array([ 0., 64., 128., 192., 256., 320., 384., 448., 512.,
576., 640., 704., 768., 832., 896., 960., 1024., 1088.,
1152., 1216., 1280., 1344., 1408., 1472., 1536., 1600., 1664.,
1728., 1792., 1856., 1920., 1984., 2048., 2112., 2176., 2240.,
2304., 2368., 2432., 2496., 2560., 2624., 2688., 2752., 2816.,
2880., 2944., 3008., 3072., 3136., 3200., 3264., 3328., 3392.,
3456., 3520., 3584., 3648., 3712., 3776., 3840., 3904., 3968.,
4032., 4096., 4160., 4224., 4288., 4352., 4416., 4480., 4544.,
4608., 4672., 4736., 4800., 4864., 4928., 4992., 5056., 5120.,
5184., 5248., 5312., 5376., 5440., 5504., 5568., 5632., 5696.,
5760., 5824., 5888., 5952., 6016., 6080., 6144., 6208., 6272.,
6336., 6400., 6464., 6528., 6592., 6656., 6720., 6784., 6848.,
6912., 6976., 7040., 7104., 7168., 7232., 7296., 7360., 7424.,
7488., 7552., 7616., 7680., 7744., 7808., 7872., 7936., 8000.,
8064., 8128., 8192.])
In [5]: freqs = _; len(freqs)
Out[5]: 129
We can see that we have generated a 1 + n_fft / 2 = 129
element array which tells us the frequencies for each corresponding bin number.
A word of caution
Take note that librosa.display.specshow
has a default sampling rate of 22050 Hz, so if you don't set the sampling rate (sr
) to the same sampling frequency as your audio signal, the vertical and horizontal axes will not be correct. Make sure you specify the sr
input flag to match your sampling frequency of the incoming audio.