How to convert a pitch track from a melody extraction algorithm to a humming like audio signal
Asked Answered
M

4

19

As part of a fun-at-home-research-project, I am trying to find a way to reduce/convert a song to a humming like audio signal (the underlying melody that we humans perceive when we listen to a song). Before I proceed any further in describing my attempt on this problem, I would like to mention that I am totally new to audio analysis though I have a lot of experience with analyzing images and videos.

After googling a bit, I found a bunch of melody extraction algorithms. Given a polyphonic audio signal of a song (ex: .wav file), they output a pitch track --- at each point in time they estimate the dominant pitch (coming from a singer's voice or some melody generating instrument) and track the dominant pitch over time.

I read a few papers, and they seem to compute a short time Fourier transform of the song, and then do some analysis on the spectrogram to get and track the dominant pitch. Melody extraction is only a component in the system I am trying to develop, so I don't mind using any algorithm that's available as far as it does a decent job on my audio files and the code is available. Since I am new to this, I would be happy to hear any suggestions on which algorithms are known to work well and where can I find its code.

I found two algorithms:

  1. Yaapt pitch tracking
  2. Melodia

I chose Melodia as the results on different music genres looked quite impressive. Please check this to see its results. The humming that you hear for each piece of music is essentially what I am interested in.

"It is the generation of this humming for any arbitrary song, that I want your help with in this question".

The algorithm (available as a vamp plugin) outputs a pitch track --- [time_stamp, pitch/frequency] --- an Nx2 matrix where in the first column is the time-stamp (in seconds) and the second column is dominant pitch detected at the corresponding time-stamp. Shown below is a visualization of the pitch-track obtained from the algorithm overlayed in purple color with a song's time-domain signal (above) and it spectrogram/short-time-fourier. Negative-values of pitch/frequency represent the algorithms dominant pitch estimate for un-voiced/non-melodic segments. So all pitch estimates >= 0 correspond to the melody, the rest are not important to me.

Pitch-track overlay with a song's waveform and spectrogram

Now I want to convert this pitch track back to a humming like audio signal -- just like the authors have it on their website.

Below is a MATLAB function that I wrote to do this:

function [melSignal] = melody2audio(melody, varargin)
% melSignal = melody2audio(melody, Fs, synthtype)
% melSignal = melody2audio(melody, Fs)
% melSignal = melody2audio(melody)
%
% Convert melody/pitch-track to a time-domain signal
%
% Inputs:
%
%     melody - [time-stamp, dominant-frequency] 
%           an Nx2 matrix with time-stamp in the 
%           first column and the detected dominant 
%           frequency at corresponding time-stamp
%           in the second column. 
% 
%     synthtype - string to choose synthesis method
%      passed to synth function in synth.m
%      current choices are: 'fm', 'sine' or 'saw'
%      default='fm'
% 
%     Fs - sampling frequency in Hz 
%       default = 44.1e3
%
%   Output:
%   
%     melSignal -- time-domain representation of the 
%                  melody. When you play this, you 
%                  are supposed to hear a humming
%                  of the input melody/pitch-track
% 

    p = inputParser;
    p.addRequired('melody', @isnumeric);
    p.addParamValue('Fs', 44100, @(x) isnumeric(x) && isscalar(x));
    p.addParamValue('synthtype', 'fm', @(x) ismember(x, {'fm', 'sine', 'saw'}));
    p.addParamValue('amp', 60/127,  @(x) isnumeric(x) && isscalar(x));
    p.parse(melody, varargin{:});

    parameters = p.Results;

    % get parameter values
    Fs = parameters.Fs;
    synthtype = parameters.synthtype;
    amp = parameters.amp;

    % generate melody
    numTimePoints = size(melody,1);
    endtime = melody(end,1);
    melSignal = zeros(1, ceil(endtime*Fs));

    h = waitbar(0, 'Generating Melody Audio' );

    for i = 1:numTimePoints

        % frequency
        freq = max(0, melody(i,2));

        % duration
        if i > 1
            n1 = floor(melody(i-1,1)*Fs)+1;
            dur = melody(i,1) - melody(i-1,1);
        else
            n1 = 1;
            dur = melody(i,1);            
        end

        % synthesize/generate signal of given freq
        sig = synth(freq, dur, amp, Fs, synthtype);

        N = length(sig);

        % augment note to whole signal
        melSignal(n1:n1+N-1) = melSignal(n1:n1+N-1) + reshape(sig,1,[]);

        % update status
        waitbar(i/size(melody,1));

    end

    close(h);

end

The underlying logic behind this code is the following: at each time-stamp, I synthesize a short-lived wave (say a sine-wave) with frequency equal to the detected dominant pitch/frequency at that time-stamp for a duration equal to its gap with the next time-stamp in the input melody matrix. I only wonder if I am doing this right.

Then I take the audio signal I get from this function and play it with the original song (melody on the left channel and original song on the right channel). Though the generated audio signal seems to segment the melody-generating sources (voice/lead-intstrument) fairly well -- its active where voice is and zero everywhere else --- the signal itself is far from being a humming (I get something like beep beep beeeeep beep beeep beeeeeeeep) that the authors show on their website. Specifically, below is a visualization showing the time-domain signal of the input song in the bottom and the time-domain signal of the melody generated using my function.

enter image description here

One main issue is -- though I am given the frequency of the wave to generate at each time-stamp and also the duration, I don't know how to set the amplitude of the wave. For now, I set the amplitude to be flat/a-constant value, and i suspect this is where the problem is.

Does anyone have any suggestions on this? I welcome suggestions in any program language (preferably MATLAB, python, C++), but I guess my question here is more general --- How to generate the wave at each time-stamp?

A few ideas/fixes in my mind:

  1. Set the amplitude by getting an averaged/max estimate of the amplitude from the time-domain signal of the original song.
  2. Totally change my approach --- compute the spectrogram/short-time fourier transform of the song's audio signal. cut-off hardly/zero-out or softly all other frequencies except the ones in my pitch-track (or are close to my pitch-track). And then compute the inverse short-time fourier transform to get the time-domain signal.
Melanoma answered 17/3, 2013 at 17:23 Comment(2)
You could create a midi that has the same pitches/pitch bends/durations/times as your melody, pick a nice instrument for it and render it in a program/library of your choice. alternatively, you could given each note an amplitude envelope that starts strong (either building up from 0 quickly or starting strong), diminishes to a quiet held amount and trails off at the end. This is called an ADSR envelope.Doncaster
Somehow my new side table as a home project isn't that impressive. -1 for making me feel like a Igor the caveman (just kidding).Nubble
F
3

Though I don't have access to your synth() function, based on the parameters it takes I'd say your problem is because you're not handling the phase.

That is - it is not enough to concatenate waveform snippets together, you must ensure that they have continuous phase. Otherwise, you're creating a discontinuity in the waveform every time you concatenate two waveform snippets. If this is the case, my guess is that you're hearing the same frequency all the time and that it sounds more like a sawtooth than a sinusoid - am I right?

The solution is to set the starting phase of snippet n to the end phase of snippet n-1. Here's an example of how you would concatenate two waveforms with different frequencies without creating a phase discontinuity:

fs = 44100; % sampling frequency

% synthesize a cosine waveform with frequency f1 and starting additional phase p1
p1 = 0;
dur1 = 1;
t1 = 0:1/fs:dur1; 

x1(1:length(t1)) = 0.5*cos(2*pi*f1*t1 + p1);

% Compute the phase at the end of the waveform
p2 = mod(2*pi*f1*dur1 + p1,2*pi);

dur2 = 1;
t2 = 0:1/fs:dur2; 
x2(1:length(t2)) = 0.5*cos(2*pi*f2*t2 + p2); % use p2 so that the phase is continuous!

x3 = [x1 x2]; % this should give you a waveform without any discontinuities

Note that whilst this gives you a continuous waveform, the frequency transition is instantaneous. If you want the frequency to gradually change from time_n to time_n+1 then you would have to use something more complex like McAulay-Quatieri interpolation. But in any case, if your snippets are short enough this should sound good enough.

Regarding other comments, if I understand correctly your goal is just to be able to hear the frequency sequence, not for it to sound like the original source. In this case, the amplitude is not that important and you can keep it fixed.

If you wanted to make it sound like the original source that's a whole different story and probably beyond the scope of this discussion.

Hope this answers your question!

Fernandes answered 18/3, 2013 at 16:35 Comment(4)
@justin thanks a lot for the solution. This was indeed the problem and after the fix it sounds way better. But i feel i need to plugin a better amplitude to be more realistic but its slightly off scope from my question in this post. I read that the perceived amplitude depends on the frequency (higher the frequency, the higher the perceived amplitude). I wonder if i can find some mathematical model for this dependence, so i can alter the amplitude based on the dominant pitch/frequency. May be that will sound even better.Melanoma
Also, can you elaborate on the McAulay-Quatieri interpolation or point me to a simpler article -- i still hear some chirpiness at the point where the signal transitions from voiced to non-voiced even after applying some smoothing.Melanoma
Changing the amplitude of the sinusoid will only make a minor difference to the perceived realisticness (is that a word?) of the synthesized signal - it will still basically sound like a single sinusoid. If you want it to sound like the original source then you have two options: either obtain a synthesizer for your source (voice/instrument) and use the f0 sequence to guide the synthesis, or use a source separation algorithm instead of an f0 estimation algorithm to directly separate the signal of the lead source (at least attempt to do so, this is still an open research problem).Fernandes
For source separation you could try Melodyne (there's a free trial version I think), or the code by J.-L. Durrieu if you're looking for something open source: durrieu.ch/research/jstsp2010.html McAulay-Quatieri interpolation is only useful for interpolating between two non-zero frequency and amplitude values (a1,f1) --> (a2,f2), it won't help you at the boundaries between voice/non-voiced. For that you should just smooth the attack. In any case, since you're using a single sinusoid (purely tonal) the attack will never sound completely "natural".Fernandes
D
5

If I understand correctly, you seem to already have an accurate representation of the pitch but your problem is that what you generate just doesn't "sound good enough".

Starting with your second approach: filtering out anything but the pitch isn't going to lead to anything good. By removing everything but a few frequency bins corresponding to your local pitch estimates, you will lose the texture of the input signal, what makes it sound good. In fact, if you took that to an extreme and removed everything but the one sample corresponding to the pitch and took an ifft, you would get exactly a sinusoid, which is what you are doing currently. If you wanted to do this anyway, I recommend you perform all of this by just applying a filter to your temporal signal rather than going in and out of the frequency domain, which is more expensive and cumbersome. The filter would have a small cutoff around the frequency you want to keep and that would allow for a sound with better texture as well.

However, if you already have pitch and duration estimates that you are happy with but you want to improve on the sound rendering, I suggest that you just replace your sine waves--which will always sound like silly beep-beep no matter how much you massage them--with some actual humming (or violin or flute or whatever you like) samples for each frequency in the scale. If memory is a concern or if the songs you represent do not fall into a well tempered scale (thinking middle-east song for example), instead of having a humming sample for each note of the scale, you could only have humming samples for a few frequencies. You would then derive the humming sounds at any frequency by doing a sample rate conversion from one of these humming samples. Having a few samples to pick from for doing the sample conversion would allow you to pick the one that leans to "best" ratio with the frequency you need to produce, since the complexity of sampling conversion depends on that ratio. Obviously adding a sample rate conversion would be more work and computationally demanding compared to just having a bank of samples to pick from.

Using a bank of real samples will make a big difference in the quality of what you render. It will also allow you to have realistic attacks for each new note you play.

Then yes, like you suggest, you may want to also play with the amplitude by following the instantaneous amplitude of the input signal to produce a more nuanced rendering of the song.

Last, I would also play with the duration estimates you have so that you have smoother transitions from one sound to the next. Guessing from your performance of your audio file that I enjoyed very much (beep beep beeeeep beep beeep beeeeeeeep) and the graph that you display, it looks like you have many interruptions inserted in the rendering of your song. You could avoid this by extending the duration estimates to get rid of any silence that is shorter than, say .1 second. That way you would preserver the real silences from the original song but avoid cutting off each note of your song.

Darra answered 18/3, 2013 at 4:42 Comment(4)
thanks a lot for your critique and suggestions. The issue was what salamon pointed out. I was not making sure that consecutive sine waves follow continuously therby introducing a sharp discontinuity between them. So the signal wasnt sounding rite. After the fix it sounds much better.Melanoma
On a tangential note on replacing the sine wave with a more realistic signal by using a bank of samples from a real instrument --- is such a bank of samples available or are there mathematical models for any instruments. Apparently, what defines the uniqueness of an instrument's sound is the overtones or harmonics --- integer multiples of the fundamental frequency. I read that its the ratio between the amplitudes of these harmonics against the fundamental frequency f that characterizes an instrument's sound. pitch track seems to be a continuum of frequencies rather than a discrete note.Melanoma
@Melanoma Yes and no. Yes these ratios you mention tend to be similar for instruments of the same type but they still vary with the specific instrument, within each note, and with how hard or softly a note is played. And yes energy tends to accumulate around integer multiples of the fundamental frequency but by no means easily reduced to them: this illustration speaks for itself: kozco.com/tech/audacity/piano_G1.jpg. So if no bank of samples is at hand, adding harmonics will definitely sound better but will remain far from sounding natural.Darra
@Melanoma And understood about your first comment: sharp discontinuities are killers. I didn't (and still don't) see as the issue described in your original post. For the record, many sound editing tools resolve that approach with another approach: instead of adjusting for the phase, which is usually impossible except near-sinusoid signals, they use cross-fading (ramping down quickly the amplitude of the previous signal and ramping up the new one).Darra
F
3

Though I don't have access to your synth() function, based on the parameters it takes I'd say your problem is because you're not handling the phase.

That is - it is not enough to concatenate waveform snippets together, you must ensure that they have continuous phase. Otherwise, you're creating a discontinuity in the waveform every time you concatenate two waveform snippets. If this is the case, my guess is that you're hearing the same frequency all the time and that it sounds more like a sawtooth than a sinusoid - am I right?

The solution is to set the starting phase of snippet n to the end phase of snippet n-1. Here's an example of how you would concatenate two waveforms with different frequencies without creating a phase discontinuity:

fs = 44100; % sampling frequency

% synthesize a cosine waveform with frequency f1 and starting additional phase p1
p1 = 0;
dur1 = 1;
t1 = 0:1/fs:dur1; 

x1(1:length(t1)) = 0.5*cos(2*pi*f1*t1 + p1);

% Compute the phase at the end of the waveform
p2 = mod(2*pi*f1*dur1 + p1,2*pi);

dur2 = 1;
t2 = 0:1/fs:dur2; 
x2(1:length(t2)) = 0.5*cos(2*pi*f2*t2 + p2); % use p2 so that the phase is continuous!

x3 = [x1 x2]; % this should give you a waveform without any discontinuities

Note that whilst this gives you a continuous waveform, the frequency transition is instantaneous. If you want the frequency to gradually change from time_n to time_n+1 then you would have to use something more complex like McAulay-Quatieri interpolation. But in any case, if your snippets are short enough this should sound good enough.

Regarding other comments, if I understand correctly your goal is just to be able to hear the frequency sequence, not for it to sound like the original source. In this case, the amplitude is not that important and you can keep it fixed.

If you wanted to make it sound like the original source that's a whole different story and probably beyond the scope of this discussion.

Hope this answers your question!

Fernandes answered 18/3, 2013 at 16:35 Comment(4)
@justin thanks a lot for the solution. This was indeed the problem and after the fix it sounds way better. But i feel i need to plugin a better amplitude to be more realistic but its slightly off scope from my question in this post. I read that the perceived amplitude depends on the frequency (higher the frequency, the higher the perceived amplitude). I wonder if i can find some mathematical model for this dependence, so i can alter the amplitude based on the dominant pitch/frequency. May be that will sound even better.Melanoma
Also, can you elaborate on the McAulay-Quatieri interpolation or point me to a simpler article -- i still hear some chirpiness at the point where the signal transitions from voiced to non-voiced even after applying some smoothing.Melanoma
Changing the amplitude of the sinusoid will only make a minor difference to the perceived realisticness (is that a word?) of the synthesized signal - it will still basically sound like a single sinusoid. If you want it to sound like the original source then you have two options: either obtain a synthesizer for your source (voice/instrument) and use the f0 sequence to guide the synthesis, or use a source separation algorithm instead of an f0 estimation algorithm to directly separate the signal of the lead source (at least attempt to do so, this is still an open research problem).Fernandes
For source separation you could try Melodyne (there's a free trial version I think), or the code by J.-L. Durrieu if you're looking for something open source: durrieu.ch/research/jstsp2010.html McAulay-Quatieri interpolation is only useful for interpolating between two non-zero frequency and amplitude values (a1,f1) --> (a2,f2), it won't help you at the boundaries between voice/non-voiced. For that you should just smooth the attack. In any case, since you're using a single sinusoid (purely tonal) the attack will never sound completely "natural".Fernandes
G
1

You have at least 2 problems.

First, as you surmised, your analysis has thrown away all the amplitude information of the melody portion of the original spectrum. You will need an algorithm that captures that information (and not just the amplitude of the entire signal for polyphonic input, or that of just the FFT pitch bin for any natural musical sounds). This is a non-trivial problem, somewhere between melodic pitch extraction and blind source separation.

Second, sound has timbre, including overtones and envelopes, even at a constant frequency. Your synthesis method is only creating a single sinewave, while humming probably creates a bunch of more interesting overtones, including a lot of higher frequencies than just the pitch. For a slightly more natural sound, you could try analyzing the spectrum of yourself humming a single pitch and try to recreate all those dozens of overtone sine waves, instead of just one, each at the appropriate relative amplitude, when synthesizing each frequency time-stamp in your analysis. You could also look at the amplitude envelope over time of yourself humming one short note, and use that envelope to modulate the amplitude of your synthesizer.

Granular answered 18/3, 2013 at 15:33 Comment(2)
the suggestion to analyze the spectrum of myself/human humming a single pitch was interesting. But i believe i can only do this probably for notes in a couple of octaves and analyze them. Then i will need a mathematical model which will allow me to get (by interpolating or extrapolating) the signal at any continuous frequency which is what the pitch track of a melody extraction algorithm seems to give me. Does such a mathematical model exist? I am new to this field, so it would be helpful if you could point me to such literature.Melanoma
The literature on vocal formants may cover some of the models you might want to try.Granular
D
0

use libfmp.c8 to sonify the values

import IPython.display as ipd
import libfmp.b
import libfmp.c8
data = vamp.collect(audio, samplerate, "mtg-melodia:melodia", parameters=params)
hop, melody = data['vector']
timestamps=np.arange(0,len(melody)) * float(hop)
melody_pos = melody[:]
melody_pos[melody<=0] = 0   #get rid off - vals
d = {'time': ts, 'frequency':pd.Series(melody_pos) }
df=pd.DataFrame(d)
traj = df.values
x_traj_mono = libfmp.c8.sonify_trajectory_with_sinusoid(traj, len(audio), sr, smooth_len=50, amplitude=0.8)
ipd.display(ipd.Audio(x_traj_mono+y, rate=sr))```
Dickens answered 7/1, 2022 at 19:9 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Umbrageous

© 2022 - 2024 — McMap. All rights reserved.