How to handle dynamic input size for audio spectrogram used in CNN?

Asked 5/5, 2016 at 7:40 Answered 22/9, 2020 at 15:26

signal-processing speech-recognition conv-neural-network spectrogram

A lot of articles are using CNNs to extract audio features. The input data is a spectrogram with two dimensions, time and frequency.

When creating an audio spectrogram, you need to specify the exact size of both dimensions. But they are usually not fixed. One can specify the size of the frequency dimension through the window size, but what about the time domain? The lengths of audio samples are different, but the size of the input data of CNNs should be fixed.

In my datasets, the audio length ranges from 1s to 8s. Padding or Cutting always impacts the results too much.

So I want to know more about this method.

Diamagnet answered 5/5, 2016 at 7:40 Comment(0)

CNNs are computed on frame windows basis. You take say 30 surrounding frames and train CNN to classify them. You need to have frame labels in this case which you can get from other speech recognition toolkit.

If you want to have pure neural network decoding, you'd better train recurrent neural network (RNN), they allow arbitrary length inputs. To increase accuracy of RNNs you also better have CTC layer which will allow adjust state alignment without network.

If you are interested in the subject you can try https://github.com/srvk/eesen, a toolkit designed for end-to-end speech recognition with recurrent neural networks.

Eller answered 6/5, 2016 at 13:51 Comment(1)

What do you mean by "30 surrounding frames"? – Cowage 21/9, 2020 at 13:2

Ok, finally I found a paper that talks about it. In the paper they say:

All audio clips were standardized by padding/clipping to a 4 second duration

So yes, what you say that impacts on your performance is what they do in papers for what I see.

An example of this kind of application can be UrbanSoundDataset. It's a dataset of different lengths audio and therefore any paper that uses it (for a non-RNN network) will be forced to use this or another approach that converts sounds to the same length vector/matrix. I recommend the paper Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification or ENVIRONMENTAL SOUND CLASSIFICATION WITH CONVOLUTIONAL NEURAL NETWORKS. The latter has its code open-sourced and you can see it also gets audio to 4 seconds on function _load_audio in this notebook.

How to clip audios

from pydub import AudioSegment

audio = pydub.AudioSegment.silent(duration=duration_ms)    # The length you want
audio = audio.overlay(pydub.AudioSegment.from_wav(path))
raw = audio.split_to_mono()[0].get_array_of_samples()      # I only keep the left sound

Mel-spectrogram

Standard is to use a mel-spectrum for this kind of applications. You could use the Python library Essentia and follow this example or use librosa like this:

# Attention, I do not cut / pad this example
y, sr = librosa.load('your-wav-file.wav')      
mel_spect = librosa.feature.melspectrogram(y=y, sr=sr, n_fft=2048, hop_length=1024)

Cowage answered 22/9, 2020 at 15:26 Comment(0)

Recommended topics

Hot tags