pythonaudiolibrosaaudio-processing

How to get complete fundamental (f0) frequency extraction with python lib librosa.pyin?


I am running librosa.pyin on a speech audio clip, and it doesn't seem to be extracting all the fundamentals (f0) from the first part of the recording.

librosa documentation: https://librosa.org/doc/main/generated/librosa.pyin.html

sr: 22050

fmin=librosa.note_to_hz('C0')
fmax=librosa.note_to_hz('C7')

f0, voiced_flag, voiced_probs = librosa.pyin(y,
                                             fmin=fmin,
                                             fmax=fmax,
                                             pad_mode='constant',
                                             n_thresholds = 10,
                                             max_transition_rate = 100,
                                             sr=sr)

Raw audio:

raw audio

Spectrogram with fundamental tones, onssets, and onset strength, but the first part doesn't have any fundamental tones extracted.

link to audio file: https://jasonmhead.com/wp-content/uploads/2022/12/quick_fox.wav

times = librosa.times_like(o_env, sr=sr)
onset_frames = librosa.onset.onset_detect(onset_envelope=o_env, sr=sr)

enter image description here

Another view with power spectrogram:

enter image description here

I tried compressing the audio, but that didn't seem to work.

Any suggestions on what parameters I can adjust, or audio pre-processing that can be done to have fundamental tones extracted from all words?

What type of things affect fundamental tone extraction success?


Solution

  • TL;DR It seems like it's all about the parameters tweaking.

    Here are some results that I've got playing with the example, it would be better to open it in a separate tab: some graphs The bottom plot shows a phonetic transcription (well, kinda) of the example file. Some conclusions I've made to myself:

    1. There are some words/parts of a word that are difficult to hear: they have low energy and when listening to them alone it doesn't sound like a word, but only when coupled with nearby segments ("the" is very short and sounds more like "z").
    2. Some words are divided into parts (e.g. "fo"-"x").
    3. I don't really know what should be the F0 frequency when someone pronounces "x". I'm not even sure that there is any difference in pronunciation between people (otherwise how do cats know that we are calling them all over the world).
    4. Two-seconds period is a pretty short amount of time.

    Some experiments:

    That's my thoughts. Hope it helps.