python matlab signal-processing pitch-tracking audio-analysis

How to convert a pitch track from a melody extraction algorithm to a humming like audio signal

As part of a fun-at-home-research-project, I am trying to find a way to reduce/convert a song to a humming like audio signal (the underlying melody that we humans perceive when we listen to a song). Before I proceed any further in describing my attempt on this problem, I would like to mention that I am totally new to audio analysis though I have a lot of experience with analyzing images and videos.

After googling a bit, I found a bunch of melody extraction algorithms. Given a polyphonic audio signal of a song (ex: .wav file), they output a pitch track --- at each point in time they estimate the dominant pitch (coming from a singer's voice or some melody generating instrument) and track the dominant pitch over time.

I read a few papers, and they seem to compute a short time Fourier transform of the song, and then do some analysis on the spectrogram to get and track the dominant pitch. Melody extraction is only a component in the system I am trying to develop, so I don't mind using any algorithm that's available as far as it does a decent job on my audio files and the code is available. Since I am new to this, I would be happy to hear any suggestions on which algorithms are known to work well and where can I find its code.

I found two algorithms:

I chose Melodia as the results on different music genres looked quite impressive. Please check this to see its results. The humming that you hear for each piece of music is essentially what I am interested in.

"It is the generation of this humming for any arbitrary song, that I want your help with in this question".

The algorithm (available as a vamp plugin) outputs a pitch track --- [time_stamp, pitch/frequency] --- an Nx2 matrix where in the first column is the time-stamp (in seconds) and the second column is dominant pitch detected at the corresponding time-stamp. Shown below is a visualization of the pitch-track obtained from the algorithm overlayed in purple color with a song's time-domain signal (above) and it spectrogram/short-time-fourier. Negative-values of pitch/frequency represent the algorithms dominant pitch estimate for un-voiced/non-melodic segments. So all pitch estimates >= 0 correspond to the melody, the rest are not important to me.

Pitch-track overlay with a song's waveform and spectrogram

Now I want to convert this pitch track back to a humming like audio signal -- just like the authors have it on their website.

Below is a MATLAB function that I wrote to do this:

function [melSignal] = melody2audio(melody, varargin)
% melSignal = melody2audio(melody, Fs, synthtype)
% melSignal = melody2audio(melody, Fs)
% melSignal = melody2audio(melody)
%
% Convert melody/pitch-track to a time-domain signal
%
% Inputs:
%
%     melody - [time-stamp, dominant-frequency] 
%           an Nx2 matrix with time-stamp in the 
%           first column and the detected dominant 
%           frequency at corresponding time-stamp
%           in the second column. 
% 
%     synthtype - string to choose synthesis method
%      passed to synth function in synth.m
%      current choices are: 'fm', 'sine' or 'saw'
%      default='fm'
% 
%     Fs - sampling frequency in Hz 
%       default = 44.1e3
%
%   Output:
%   
%     melSignal -- time-domain representation of the 
%                  melody. When you play this, you 
%                  are supposed to hear a humming
%                  of the input melody/pitch-track
% 

    p = inputParser;
    p.addRequired('melody', @isnumeric);
    p.addParamValue('Fs', 44100, @(x) isnumeric(x) && isscalar(x));
    p.addParamValue('synthtype', 'fm', @(x) ismember(x, {'fm', 'sine', 'saw'}));
    p.addParamValue('amp', 60/127,  @(x) isnumeric(x) && isscalar(x));
    p.parse(melody, varargin{:});

    parameters = p.Results;

    % get parameter values
    Fs = parameters.Fs;
    synthtype = parameters.synthtype;
    amp = parameters.amp;

    % generate melody
    numTimePoints = size(melody,1);
    endtime = melody(end,1);
    melSignal = zeros(1, ceil(endtime*Fs));

    h = waitbar(0, 'Generating Melody Audio' );

    for i = 1:numTimePoints

        % frequency
        freq = max(0, melody(i,2));

        % duration
        if i > 1
            n1 = floor(melody(i-1,1)*Fs)+1;
            dur = melody(i,1) - melody(i-1,1);
        else
            n1 = 1;
            dur = melody(i,1);            
        end

        % synthesize/generate signal of given freq
        sig = synth(freq, dur, amp, Fs, synthtype);

        N = length(sig);

        % augment note to whole signal
        melSignal(n1:n1+N-1) = melSignal(n1:n1+N-1) + reshape(sig,1,[]);

        % update status
        waitbar(i/size(melody,1));

    end

    close(h);

end

The underlying logic behind this code is the following: at each time-stamp, I synthesize a short-lived wave (say a sine-wave) with frequency equal to the detected dominant pitch/frequency at that time-stamp for a duration equal to its gap with the next time-stamp in the input melody matrix. I only wonder if I am doing this right.

Then I take the audio signal I get from this function and play it with the original song (melody on the left channel and original song on the right channel). Though the generated audio signal seems to segment the melody-generating sources (voice/lead-intstrument) fairly well -- its active where voice is and zero everywhere else --- the signal itself is far from being a humming (I get something like beep beep beeeeep beep beeep beeeeeeeep) that the authors show on their website. Specifically, below is a visualization showing the time-domain signal of the input song in the bottom and the time-domain signal of the melody generated using my function.

enter image description here

One main issue is -- though I am given the frequency of the wave to generate at each time-stamp and also the duration, I don't know how to set the amplitude of the wave. For now, I set the amplitude to be flat/a-constant value, and i suspect this is where the problem is.

Does anyone have any suggestions on this? I welcome suggestions in any program language (preferably MATLAB, python, C++), but I guess my question here is more general --- How to generate the wave at each time-stamp?

A few ideas/fixes in my mind:

Set the amplitude by getting an averaged/max estimate of the amplitude from the time-domain signal of the original song.
Totally change my approach --- compute the spectrogram/short-time fourier transform of the song's audio signal. cut-off hardly/zero-out or softly all other frequencies except the ones in my pitch-track (or are close to my pitch-track). And then compute the inverse short-time fourier transform to get the time-domain signal.

Solution

Though I don't have access to your synth() function, based on the parameters it takes I'd say your problem is because you're not handling the phase.

That is - it is not enough to concatenate waveform snippets together, you must ensure that they have continuous phase. Otherwise, you're creating a discontinuity in the waveform every time you concatenate two waveform snippets. If this is the case, my guess is that you're hearing the same frequency all the time and that it sounds more like a sawtooth than a sinusoid - am I right?

The solution is to set the starting phase of snippet n to the end phase of snippet n-1. Here's an example of how you would concatenate two waveforms with different frequencies without creating a phase discontinuity:

fs = 44100; % sampling frequency

% synthesize a cosine waveform with frequency f1 and starting additional phase p1
p1 = 0;
dur1 = 1;
t1 = 0:1/fs:dur1; 

x1(1:length(t1)) = 0.5*cos(2*pi*f1*t1 + p1);

% Compute the phase at the end of the waveform
p2 = mod(2*pi*f1*dur1 + p1,2*pi);

dur2 = 1;
t2 = 0:1/fs:dur2; 
x2(1:length(t2)) = 0.5*cos(2*pi*f2*t2 + p2); % use p2 so that the phase is continuous!

x3 = [x1 x2]; % this should give you a waveform without any discontinuities

Note that whilst this gives you a continuous waveform, the frequency transition is instantaneous. If you want the frequency to gradually change from time_n to time_n+1 then you would have to use something more complex like McAulay-Quatieri interpolation. But in any case, if your snippets are short enough this should sound good enough.

Regarding other comments, if I understand correctly your goal is just to be able to hear the frequency sequence, not for it to sound like the original source. In this case, the amplitude is not that important and you can keep it fixed.

If you wanted to make it sound like the original source that's a whole different story and probably beyond the scope of this discussion.

Hope this answers your question!