As part of a fun-at-home-research-project, I am trying to find a way to reduce/convert a song to a humming like audio signal (the underlying melody that we humans perceive when we listen to a song). Before I proceed any further in describing my attempt on this problem, I would like to mention that I am totally new to audio analysis though I have a lot of experience with analyzing images and videos.
After googling a bit, I found a bunch of melody extraction algorithms. Given a polyphonic audio signal of a song (ex: .wav file), they output a pitch track --- at each point in time they estimate the dominant pitch (coming from a singer's voice or some melody generating instrument) and track the dominant pitch over time.
I read a few papers, and they seem to compute a short time Fourier transform of the song, and then do some analysis on the spectrogram to get and track the dominant pitch. Melody extraction is only a component in the system I am trying to develop, so I don't mind using any algorithm that's available as far as it does a decent job on my audio files and the code is available. Since I am new to this, I would be happy to hear any suggestions on which algorithms are known to work well and where can I find its code.
I found two algorithms:
I chose Melodia as the results on different music genres looked quite impressive. Please check this to see its results. The humming that you hear for each piece of music is essentially what I am interested in.
"It is the generation of this humming for any arbitrary song, that I want your help with in this question".
The algorithm (available as a vamp plugin) outputs a pitch track --- [time_stamp, pitch/frequency] --- an Nx2 matrix where in the first column is the time-stamp (in seconds) and the second column is dominant pitch detected at the corresponding time-stamp. Shown below is a visualization of the pitch-track obtained from the algorithm overlayed in purple color with a song's time-domain signal (above) and it spectrogram/short-time-fourier. Negative-values of pitch/frequency represent the algorithms dominant pitch estimate for un-voiced/non-melodic segments. So all pitch estimates >= 0 correspond to the melody, the rest are not important to me.
Now I want to convert this pitch track back to a humming like audio signal -- just like the authors have it on their website.
Below is a MATLAB function that I wrote to do this:
function [melSignal] = melody2audio(melody, varargin)
% melSignal = melody2audio(melody, Fs, synthtype)
% melSignal = melody2audio(melody, Fs)
% melSignal = melody2audio(melody)
%
% Convert melody/pitch-track to a time-domain signal
%
% Inputs:
%
% melody - [time-stamp, dominant-frequency]
% an Nx2 matrix with time-stamp in the
% first column and the detected dominant
% frequency at corresponding time-stamp
% in the second column.
%
% synthtype - string to choose synthesis method
% passed to synth function in synth.m
% current choices are: 'fm', 'sine' or 'saw'
% default='fm'
%
% Fs - sampling frequency in Hz
% default = 44.1e3
%
% Output:
%
% melSignal -- time-domain representation of the
% melody. When you play this, you
% are supposed to hear a humming
% of the input melody/pitch-track
%
p = inputParser;
p.addRequired('melody', @isnumeric);
p.addParamValue('Fs', 44100, @(x) isnumeric(x) && isscalar(x));
p.addParamValue('synthtype', 'fm', @(x) ismember(x, {'fm', 'sine', 'saw'}));
p.addParamValue('amp', 60/127, @(x) isnumeric(x) && isscalar(x));
p.parse(melody, varargin{:});
parameters = p.Results;
% get parameter values
Fs = parameters.Fs;
synthtype = parameters.synthtype;
amp = parameters.amp;
% generate melody
numTimePoints = size(melody,1);
endtime = melody(end,1);
melSignal = zeros(1, ceil(endtime*Fs));
h = waitbar(0, 'Generating Melody Audio' );
for i = 1:numTimePoints
% frequency
freq = max(0, melody(i,2));
% duration
if i > 1
n1 = floor(melody(i-1,1)*Fs)+1;
dur = melody(i,1) - melody(i-1,1);
else
n1 = 1;
dur = melody(i,1);
end
% synthesize/generate signal of given freq
sig = synth(freq, dur, amp, Fs, synthtype);
N = length(sig);
% augment note to whole signal
melSignal(n1:n1+N-1) = melSignal(n1:n1+N-1) + reshape(sig,1,[]);
% update status
waitbar(i/size(melody,1));
end
close(h);
end
The underlying logic behind this code is the following: at each time-stamp, I synthesize a short-lived wave (say a sine-wave) with frequency equal to the detected dominant pitch/frequency at that time-stamp for a duration equal to its gap with the next time-stamp in the input melody matrix. I only wonder if I am doing this right.
Then I take the audio signal I get from this function and play it with the original song (melody on the left channel and original song on the right channel). Though the generated audio signal seems to segment the melody-generating sources (voice/lead-intstrument) fairly well -- its active where voice is and zero everywhere else --- the signal itself is far from being a humming (I get something like beep beep beeeeep beep beeep beeeeeeeep) that the authors show on their website. Specifically, below is a visualization showing the time-domain signal of the input song in the bottom and the time-domain signal of the melody generated using my function.
One main issue is -- though I am given the frequency of the wave to generate at each time-stamp and also the duration, I don't know how to set the amplitude of the wave. For now, I set the amplitude to be flat/a-constant value, and i suspect this is where the problem is.
Does anyone have any suggestions on this? I welcome suggestions in any program language (preferably MATLAB, python, C++), but I guess my question here is more general --- How to generate the wave at each time-stamp?
A few ideas/fixes in my mind:
- Set the amplitude by getting an averaged/max estimate of the amplitude from the time-domain signal of the original song.
- Totally change my approach --- compute the spectrogram/short-time fourier transform of the song's audio signal. cut-off hardly/zero-out or softly all other frequencies except the ones in my pitch-track (or are close to my pitch-track). And then compute the inverse short-time fourier transform to get the time-domain signal.