why my 8kHz wav file's mel feature extracted differently in sr = 16kHz and 44.1kHz
Asked Answered
C

1

6

I'm currently extracting mel features from my baby cry sound dataset and the wav files' sampling rate is 8kHz, 16bit, mono and about 7 sec.

Mel-Spectogram when sr = 16000 Mel-Spectogram when sr = 44100

But as you can see, whenever I extract features with different sampling rates sr, the values of the mel-spectrogram change. I thought that since the wav file's sampling rate is 8kHz, if I set the sampling rate to over 16kHz the value of Hertz must be same.

I converted wav file's sampling rate 8kHz to 44.1kHz and extracted it again but nothing changes.

This is my code:

import librosa.display  
import matplotlib.pyplot as plt  
import numpy as np  

sr = 44100 # or 16000 
frame_length = 0.1  
frame_stride = 0.01  
path = '...'
train = []  

j, sr = librosa.load(path + '001.wav', sr, duration = 5.0)   
input_nfft = int(round(sr*frame_length))   
input_stride = int(round(sr*frame_stride))   
mel = librosa.feature.melspectrogram(j, n_mels = 128, n_fft = input_nfft, hop_length=input_stride, sr = sr)            
train.append(mel)

plt.figure(figsize=(10,4))  
librosa.display.specshow(librosa.power_to_db(train[0], ref=np.max), y_axis='mel', sr=sr, hop_length=input_stride, x_axis='time')  
plt.colorbar(format='%+2.0f dB')  
plt.title('Mel-Spectrogram')  
plt.tight_layout()  
plt.show()  

The value of y-axis must be the same whatever sr = 44100 or 16000 but I don't understand why it happens.

Coldhearted answered 16/7, 2019 at 9:0 Comment(1)
This is where I got dataset github.com/gveres/donateacry-corpus/tree/master/…Coldhearted
Z
12

When you are asking librosa to create a mel spectrogram, you are asking it to perform two steps:

Fourier Transform-based Spectrum

First you are asking it to create an FFT-based spectrogram over the possible range. To understand what the possible range is, you must understand the Nyquist-Shannon theorem, which (roughly) states that you cannot represent frequencies over sr/2 Hz (sr = sample rate), when you sample a signal at sr Hz. So the possible frequency range for a signal sampled at 44.1kHz is 0 to 22.05 kHz.

librosa produces a regular linearly spaced spectrogram as intermediate result. The frequency range is 0 to sr/2 Hz.

Mel Spectrogram

Contrary to a regular FT-based spectrogram, a mel spectrogram, does not have a linear frequency scale, but an (almost) logarithmic scale. To map the FT-based spectrogram to the logarithmic scale, all available data is mapped to a specific number of logarithmically spaced bins. The number of bins that is used is specified as n_mels, i.e. number of mel bands.

Putting it Together

So for n_mels = 128, if you have a signal that is sampled at 44.1kHz, you can represent a range from 0 to 22.05 Hz. This range is mapped onto 128 logarithmically spaced bands. If your signal is sampled at 16 kHz, you can represent a range from 0 to 8 Hz. This range is mapped onto 128 logarithmically spaced bands, i.e., the range from 0-8 kHz is split into 128 parts, and not the range from 0-22.05 kHz. This must lead to different results.

The Solution

If you want to make sure that the frequency range that is mapped to n_mels mel bands is always the same, regardless of sampling rate, you must specify the keyword parameters fmin and fmax (see here).

E.g.:

fmin = 0.
fmax = 4000.  # since your original signal is sampled at 8 kHz
mel = librosa.feature.melspectrogram(j, n_mels=128,
                                     n_fft=input_nfft,
                                     hop_length=input_stride,
                                     sr=sr,
                                     fmin=fmin,
                                     fmax=fmax)            
Zone answered 16/7, 2019 at 10:28 Comment(3)
If my answer answered your question, please accept it. Thank you.Zone
Thanks for this answer. I stumbled across it while facing the same issue and tried your solution, although I still see the frequency shift from one sr to the other. So my srs are 48k and 28k. While fmin=40, fmax=14000. I plot both using specshow. The first one has y-range up to 24kHz, while the second has up to 14kHz, and the frequency axes do not align. I have made sure that the sr at least 2x my frequency range in each case. Just checking to see if you have any other ideas about what might cause it.Iridissa
I’d suggest to post a minimal reproducible example as a new question. Perhaps it’s just a minor coding issue.Zone

© 2022 - 2024 — McMap. All rights reserved.