I am running librosa.pyin on a speech audio clip, and it doesn't seem to be extracting all the fundamentals (f0) from the first part of the recording.
librosa documentation: https://librosa.org/doc/main/generated/librosa.pyin.html
sr: 22050
fmin=librosa.note_to_hz('C0')
fmax=librosa.note_to_hz('C7')
f0, voiced_flag, voiced_probs = librosa.pyin(y,
fmin=fmin,
fmax=fmax,
pad_mode='constant',
n_thresholds = 10,
max_transition_rate = 100,
sr=sr)
Raw audio:
Spectrogram with fundamental tones, onssets, and onset strength, but the first part doesn't have any fundamental tones extracted.
link to audio file: https://jasonmhead.com/wp-content/uploads/2022/12/quick_fox.wav
times = librosa.times_like(o_env, sr=sr)
onset_frames = librosa.onset.onset_detect(onset_envelope=o_env, sr=sr)
Another view with power spectrogram:
I tried compressing the audio, but that didn't seem to work.
Any suggestions on what parameters I can adjust, or audio pre-processing that can be done to have fundamental tones extracted from all words?
What type of things affect fundamental tone extraction success?
TL;DR It seems like it's all about the parameters tweaking.
Here are some results that I've got playing with the example, it would be better to open it in a separate tab: The bottom plot shows a phonetic transcription (well, kinda) of the example file. Some conclusions I've made to myself:
Some experiments:
That's my thoughts. Hope it helps.