python signal-processing fft linear-interpolation pitch-detection

Change the melody of human speech using FFT and polynomial interpolation

I'm trying to do the following:

Extract the melody of me asking a question (word "Hey?" recorded to wav) so I get a melody pattern that I can apply to any other recorded/synthesized speech (basically how F0 changes in time).
Use polynomial interpolation (Lagrange?) so I get a function that describes the melody (approximately of course).
Apply the function to another recorded voice sample. (eg. word "Hey." so it's transformed to a question "Hey?", or transform the end of a sentence to sound like a question [eg. "Is it ok." => "Is it ok?"]). Voila, that's it.

What I have done? Where am I? Firstly, I have dived into the math that stands behind the fft and signal processing (basics). I want to do it programatically so I decided to use python.

I performed the fft on the entire "Hey?" voice sample and got data in frequency domain (please don't mind y-axis units, I haven't normalized them)

So far so good. Then I decided to divide my signal into chunks so I get more clear frequency information - peaks and so on - this is a blind shot, me trying to grasp the idea of manipulating the frequency and analyzing the audio data. It gets me nowhere however, not in a direction I want, at least.

Now, if I took those peaks, got an interpolated function from them, and applied the function on another voice sample (a part of a voice sample, that is also ffted of course) and performed inversed fft I wouldn't get what I wanted, right? I would only change the magnitude so it wouldn't affect the melody itself (I think so).

Then I used spec and pyin methods from librosa to extract the real F0-in-time - the melody of asking question "Hey?". And as we would expect, we can clearly see an increase in frequency value:

And a non-question statement looks like this - let's say it's moreless constant.

The same applies to a longer speech sample:

Now, I assume that I have blocks to build my algorithm/process but I still don't know how to assemble them beacause there are some blanks in my understanding of what's going on under the hood.

I consider that I need to find a way to map the F0-in-time curve from the spectrogram to the "pure" FFT data, get an interpolated function from it and then apply the function on another voice sample.

Is there any elegant (inelegant would be ok too) way to do this? I need to be pointed in a right direction beceause I can feel I'm close but I'm basically stuck.

The code that works behind the above charts is taken just from the librosa docs and other stackoverflow questions, it's just a draft/POC so please don't comment on style, if you could :)

fft in chunks:

import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
import os

file = os.path.join("dir", "hej_n_nat.wav")
fs, signal = wavfile.read(file)

CHUNK = 1024

afft = np.abs(np.fft.fft(signal[0:CHUNK]))
freqs = np.linspace(0, fs, CHUNK)[0:int(fs / 2)]
spectrogram_chunk = freqs / np.amax(freqs * 1.0)

# Plot spectral analysis
plt.plot(freqs[0:250], afft[0:250])
plt.show()

spectrogram:

import librosa.display
import numpy as np
import matplotlib.pyplot as plt
import os

file = os.path.join("/path/to/dir", "hej_n_nat.wav")
y, sr = librosa.load(file, sr=44100)
f0, voiced_flag, voiced_probs = librosa.pyin(y, fmin=librosa.note_to_hz('C2'), fmax=librosa.note_to_hz('C7'))


times = librosa.times_like(f0)

D = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)

fig, ax = plt.subplots()

img = librosa.display.specshow(D, x_axis='time', y_axis='log', ax=ax)

ax.set(title='pYIN fundamental frequency estimation')

fig.colorbar(img, ax=ax, format="%+2.f dB")

ax.plot(times, f0, label='f0', color='cyan', linewidth=2)

ax.legend(loc='upper right')
plt.show()

Hints, questions and comments much appreciated.

Solution

The problem was that I didn't know how to modify the fundamental frequency (F0). By modifying it I mean modify F0 and its harmonics, as well.

The spectrograms in question show frequencies at certain points in time with power (dB) of certain frequency point. Since I know which time bin holds which frequency from the melody (green line below) ...

....I need to compute a function that represents that green line so I can apply it to other speech samples.

So I need to use some interpolation method which takes as parameters the sample F0 function points.

One need to remember that degree of the polynomial should equal to the number of points. The example doesn't have that unfortunately, but the effect is somehow ok as for the prototype.

def _get_bin_nr(val, bins):
    the_bin_no = np.nan
    for b in range(0, bins.size - 1):
        if bins[b] <= val < bins[b + 1]:
            the_bin_no = b
        elif val > bins[bins.size - 1]:
            the_bin_no = bins.size - 1
    return the_bin_no

def calculate_pattern_poly_coeff(file_name):
y_source, sr_source = librosa.load(os.path.join(ROOT_DIR, file_name), sr=sr)
f0_source, voiced_flag, voiced_probs = librosa.pyin(y_source, fmin=librosa.note_to_hz('C2'),
                                                    fmax=librosa.note_to_hz('C7'), pad_mode='constant',
                                                    center=True, frame_length=4096, hop_length=512, sr=sr_source)
all_freq_bins = librosa.core.fft_frequencies(sr=sr, n_fft=n_fft)
f0_freq_bins = list(filter(lambda x: np.isfinite(x), map(lambda val: _get_bin_nr(val, all_freq_bins), f0_source)))

return np.polynomial.polynomial.polyfit(np.arange(0, len(f0_freq_bins), 1), f0_freq_bins, 3)

def calculate_pattern_poly_func(coefficients):
    return np.poly1d(coefficients)

Method calculate_pattern_poly_coeff calculates polynomial coefficients.

Using pythons poly1d lib I can compute function which can modify the speech. How to do that? I just need to move up or down all values vertically at certain point in time. for instance I want to move all frequencies at time bin 0,75 seconds up 3 times -> it means that frequency will be increased and the melody at that point will sound higher.

Code:

def transform(sentence_audio_sample, mode=None, show_spectrograms=False, frames_from_end_to_transform=12):
# cutting out silence
y_trimmed, idx = librosa.effects.trim(sentence_audio_sample, top_db=60, frame_length=256, hop_length=64)

stft_original = librosa.stft(y_trimmed, hop_length=hop_length, pad_mode='constant', center=True)

stft_original_roll = stft_original.copy()
rolled = stft_original_roll.copy()

source_frames_count = np.shape(stft_original_roll)[1]
sentence_ending_first_frame = source_frames_count - frames_from_end_to_transform
sentence_len = np.shape(stft_original_roll)[1]

for i in range(sentence_ending_first_frame + 1, sentence_len):
    if mode == 'question':
        by = int(_question_pattern(i) / 500)
    elif mode == 'exclamation':
        by = int(_exclamation_pattern(i) / 500)
    else:
        by = 0
    rolled = _roll_column(rolled, i, by)

transformed_data = librosa.istft(rolled, hop_length=hop_length, center=True)

def _roll_column(two_d_array, column, shift): two_d_array[:, column] = np.roll(two_d_array[:, column], shift) return two_d_array

In this case I am simply rolling up or down frequencies referencing certain time bin.

This needs to be polished as it doesn't take into consideration an actual state of the transformed sample. It just rolls it up/down according to the factor calculated using the polynomial function computer earlier.

You can check full code of my project at github, "audio" package contains pattern calculator and audio transform algorithm described above.

Feel free to ask if something's unclear :)