python signal-processing librosa pitch-tracking onset-detection

Recorded audio of one note produces multiple onset times

I am using the Librosa library for pitch and onset detection. Specifically, I am using onset_detect and piptrack.

This is my code:

def detect_pitch(y, sr, onset_offset=5, fmin=75, fmax=1400):
  y = highpass_filter(y, sr)

  onset_frames = librosa.onset.onset_detect(y=y, sr=sr)
  pitches, magnitudes = librosa.piptrack(y=y, sr=sr, fmin=fmin, fmax=fmax)

  notes = []

  for i in range(0, len(onset_frames)):
    onset = onset_frames[i] + onset_offset
    index = magnitudes[:, onset].argmax()
    pitch = pitches[index, onset]
    if (pitch != 0):
      notes.append(librosa.hz_to_note(pitch))

  return notes

def highpass_filter(y, sr):
  filter_stop_freq = 70  # Hz
  filter_pass_freq = 100  # Hz
  filter_order = 1001

  # High-pass filter
  nyquist_rate = sr / 2.
  desired = (0, 0, 1, 1)
  bands = (0, filter_stop_freq, filter_pass_freq, nyquist_rate)
  filter_coefs = signal.firls(filter_order, bands, desired, nyq=nyquist_rate)

  # Apply high-pass filter
  filtered_audio = signal.filtfilt(filter_coefs, [1], y)
  return filtered_audio

When running this on guitar audio samples recorded in a studio, therefore samples without noise (like this), I get very good results in both functions. The onset times are correct and the frequencies are almost always correct (with some octave errors sometimes).

However, a big problem arises when I try to record my own guitar sounds with my cheap microphone. I get audio files with noise, such as this. The onset_detect algorithm gets confused and thinks that noise contains onset times. Therefore, I get very bad results. I get many onset times even if my audio file consists of one note.

Here are two waveforms. The first is of a guitar sample of a B3 note recorded in a studio, whereas the second is my recording of an E2 note.

The result of the first is correctly B3 (the one onset time was detected). The result of the second is an array of 7 elements, which means that 7 onset times were detected, instead of 1! One of those elements is the correct onset time, other elements are just random peaks in the noise part.

Another example is this audio file containing the notes B3, C4, D4, E4:

As you can see, the noise is clear and my high-pass filter has not helped (this is the waveform after applying the filter).

I assume this is a matter of noise, as the difference between those files lies there. If yes, what could I do to reduce it? I have tried using a high-pass filter but there is no change.

Solution

I have three observations to share.

First, after a bit of playing around, I've concluded that the onset detection algorithm appears as if it's probably probably been designed to automatically rescale its own operation in order to take into account local background noise at any given instant. This is likely in order so that it can detect onset times in pianissimo sections with equal likelihood as it would in fortissimo sections. This has the unfortunate result that the algorithm tends to trigger on background noise coming from your cheap microphone--the onset detection algorithm honestly thinks it's simply listening to pianissimo music.

A second observation is that roughly the first ~2200 samples in your recorded example (roughly the first 0.1 seconds) are a bit wonky, in the sense that the noise truly is nearly zero during that short initial interval. Try zooming way into the waveform at the starting point and you'll see what I mean. Unfortunately, the start of the guitar playing follows so quickly after the noise onset (roughly around sample 3000) that the algorithm is unable to resolve the two independently--instead it simply merges the two into a single onset event that begins about 0.1 seconds too early. I therefore cut out roughly the first 2240 samples in order to "normalize" the file (I don't think this is cheating though; it's an edge effect that would likely disappear if you had simply recorded a second or so of initial silence prior to plucking the first string, as one would normally do).

My third observation is that frequency-based filtering only works if the noise and the music are actually in somewhat different frequency bands. That may be true in this case, however I don't think you've demonstrated it yet. Therefore, instead of frequency-based filtering, I elected to try a different approach: thresholding. I used the final 3 seconds of your recording, where there is no guitar playing, in order to estimate the typical background noise level throughout the recording, in units of RMS energy, and then I used that median value to set a minimum energy threshold which was calculated to lie safely above the median. Only onset events returned by the detector occurring at times when the RMS energy is above the threshold are accepted as "valid".

An example script is shown below:

import librosa
import numpy as np
import matplotlib.pyplot as plt

# I played around with this but ultimately kept the default value
hoplen=512

y, sr = librosa.core.load("./Vocaroo_s07Dx8dWGAR0.mp3")
# Note that the first ~2240 samples (0.1 seconds) are anomalously low noise,
# so cut out this section from processing
start = 2240
y = y[start:]
idx = np.arange(len(y))

# Calcualte the onset frames in the usual way
onset_frames = librosa.onset.onset_detect(y=y, sr=sr, hop_length=hoplen)
onstm = librosa.frames_to_time(onset_frames, sr=sr, hop_length=hoplen)

# Calculate RMS energy per frame.  I shortened the frame length from the
# default value in order to avoid ending up with too much smoothing
rmse = librosa.feature.rmse(y=y, frame_length=512, hop_length=hoplen)[0,]
envtm = librosa.frames_to_time(np.arange(len(rmse)), sr=sr, hop_length=hoplen)
# Use final 3 seconds of recording in order to estimate median noise level
# and typical variation
noiseidx = [envtm > envtm[-1] - 3.0]
noisemedian = np.percentile(rmse[noiseidx], 50)
sigma = np.percentile(rmse[noiseidx], 84.1) - noisemedian
# Set the minimum RMS energy threshold that is needed in order to declare
# an "onset" event to be equal to 5 sigma above the median
threshold = noisemedian + 5*sigma
threshidx = [rmse > threshold]
# Choose the corrected onset times as only those which meet the RMS energy
# minimum threshold requirement
correctedonstm = onstm[[tm in envtm[threshidx] for tm in onstm]]

# Print both in units of actual time (seconds) and sample ID number
print(correctedonstm+start/sr)
print(correctedonstm*sr+start)

fg = plt.figure(figsize=[12, 8])

# Print the waveform together with onset times superimposed in red
ax1 = fg.add_subplot(2,1,1)
ax1.plot(idx+start, y)
for ii in correctedonstm*sr+start:
    ax1.axvline(ii, color='r')
ax1.set_ylabel('Amplitude', fontsize=16)

# Print the RMSE together with onset times superimposed in red
ax2 = fg.add_subplot(2,1,2, sharex=ax1)
ax2.plot(envtm*sr+start, rmse)
for ii in correctedonstm*sr+start:
    ax2.axvline(ii, color='r')
# Plot threshold value superimposed as a black dotted line
ax2.axhline(threshold, linestyle=':', color='k')
ax2.set_ylabel("RMSE", fontsize=16)
ax2.set_xlabel("Sample Number", fontsize=16)

fg.show()

Printed output looks like:

In [1]: %run rosatest
[ 0.17124717  1.88952381  3.74712018  5.62793651]
[   3776.   41664.   82624.  124096.]

and the plot that it produces is shown below: