Ok, finally I found a paper that talks about it. In the paper they say:
All audio clips were standardized by padding/clipping to a 4 second
duration
So yes, what you say that impacts on your performance is what they do in papers for what I see.
An example of this kind of application can be UrbanSoundDataset. It's a dataset of different lengths audio and therefore any paper that uses it (for a non-RNN network) will be forced to use this or another approach that converts sounds to the same length vector/matrix. I recommend the paper Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification or ENVIRONMENTAL SOUND CLASSIFICATION WITH CONVOLUTIONAL NEURAL NETWORKS. The latter has its code open-sourced and you can see it also gets audio to 4 seconds on function _load_audio
in this notebook.
How to clip audios
from pydub import AudioSegment
audio = pydub.AudioSegment.silent(duration=duration_ms) # The length you want
audio = audio.overlay(pydub.AudioSegment.from_wav(path))
raw = audio.split_to_mono()[0].get_array_of_samples() # I only keep the left sound
Mel-spectrogram
Standard is to use a mel-spectrum for this kind of applications. You could use the Python library Essentia and follow this example or use librosa like this:
# Attention, I do not cut / pad this example
y, sr = librosa.load('your-wav-file.wav')
mel_spect = librosa.feature.melspectrogram(y=y, sr=sr, n_fft=2048, hop_length=1024)