pythonnumpyscipycorrelationcross-correlation

Finding where an excerpt starts in an audio file: cross-correlation coefficients between two arrays in Python


I have been pulling my hair on a problem, and all of the answers I found on StackOverflow so far did not help - so I am asking for your help.

The overall problem

I would like to create a function that finds the exact timestamp where an audio excerpt starts in a larger audio file. For test purposes, I used a 5 minutes audio file and a 43 second excerpt of it. Below, I aligned the two audio files in Audacity: the excerpt starts exactly at 00:01:55.554920.

Exact alignment in Audacity between an audio file and a 43-second excerpt.

I would also like the function to return a value if, and only if it has a confidence value over a certain threshold, that would actually be a parameter of the function. The way I intend to do it is by checking if the correlation coefficient between the two aligned signals is over the given threshold.

In other words, here is a simplified version of the code:

find_excerpt_starting_sample(original_audio, excerpt, threshold):

    # Find the cross-correlation coefficients for each lag
    xcorr = cross_correlation(original_audio, excerpt)

    # Return the lag of the max correlation if it is over threshold
    if np.max(xcorr) > threshold:
        return np.argmax(xcorr)
    else:
        raise Exception("No correlation over threshold found.")

I have been having a lot of troubles finding the right cross_correlation function, because none of my attempts returned an array that would be between 0 and 1.

The problem, simplified

As my attempts on the audio files have been inconclusive, I have tried to perform the same thing on two numerical arrays:

y1 = [2, 22, 14, 8, 0, 4, 8, 16, 26, 6, 12, 14, 16, 2, 6]
y2 = [4, 8, 16, 26, 6, 12]

Here, y2 contains a subset of y1 (starting at index 5). In order to make sure that the function works independently of the amplitude scale, I halved all of the values of y2:

y1 = [2, 22, 14, 8, 0, 4, 8, 16, 26, 6, 12, 14, 16, 2, 6]
y2 = [2, 4, 8, 13, 3, 6]

I would like to create a cross-correlation function that returns an array where the value at lag 5 is 1.

My attempts so far

np.corrcoef

If we just do a simple correlation and slide the excerpt along the original audio, it works:

import numpy as np
import matplotlib as plt

corr = np.zeros(len(y1) - len(y2))
for i in range(len(y1) - len(y2)):
    corr[i] = np.corrcoef(y1[i:i+len(y2)], y2)[0][1]

print(corr)
plt.plot(corr)
plt.show()

The output is:

[ 0.18961375 -0.71250433 -0.56075283 -0.08468414  0.21913077  1. -0.04179451 -0.46803451 -0.24815461]

The cross-correlation, using np.corrcoeff

The problem is that this technique is really, really not efficient for longer files.

scipy.signal.correlate

Now, instead of reinventing the wheel, I started using one of the main solutions found on Stack Overflow, i.e. the correlate function of scipy.signal. It returns values where the proper lag is found. However, as is, because it performs a convolution, there is no way to quantify the correlation.

from scipy import signal

xcorr = signal.correlate(y1, y2, mode="full")
lags = signal.correlation_lags(len(y1), len(y2), mode="full")

print(xcorr)
print(lags)
plt.plot(lags, xcorr)
plt.show()

The output is:

[ 12 138 176 392 390 332 224 232 356 402 596 486 478 414 422 252 186  88 28  12]
[-5 -4 -3 -2 -1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]

The cross-correlation, using scipy.signal.correlate

I saw a few solutions, but they do not work as I intend.

First, a solution here suggests to normalize the coefficient using this function:

corr = signal.correlate(y1 / np.std(y1), y2 / np.std(y2), 'full') / min(len(y1), len(y2))
lags = signal.correlation_lags(len(y1), len(y2), mode="full")
print(c)
plt.plot(lags, c) 
plt.show() 

The output is:

[0.0736392  0.84685082 1.08004163 2.40554727 2.39327407 2.03735126 1.37459844 1.42369124 2.18462966 2.46691327 3.65741371 2.98238769 2.93329489 2.54055247 2.58964528 1.54642325 1.14140763 0.54002082 0.17182481 0.0736392 ]

The normalized cross-correlation, using scipy.signal.correlate

As you can see, the maximum value is not 1, but 3.65741371.

I then tried another solution found here:

y1n = y1 / np.std(y1)
y2n = y2 / np.std(y2)
xcorr = signal.correlate(y1n, y2n, mode="full")
lags = signal.correlation_lags(len(y1), len(y2), mode="full")
print(xcorr)
plt.plot(lags, xcorr)
plt.show()

The output is:

[ 0.44183521  5.08110495  6.48024979 14.43328362 14.35964442 12.22410756  8.24759064  8.54214745 13.10777798 14.80147963 21.94448224 17.89432612 17.59976931 15.24331484 15.53787165  9.27853947  6.8484458   3.24012489  1.03094883  0.44183521]

A second attempt at the normalized cross-correlation, using scipy.signal.correlate

Once again, the max value for the cross-correlation is not 1 but 21.94448224

A cry for help

There is a lot I do not know about correlation - I dug into it, but before going deeper I'm asking, if one of you happened to be able to point me in the right direction, and what I did wrong so far.

Thanks a lot!


Solution

  • I adapted the code from @Onyambu (thanks again for your help!) in order to replace the convolutions by signal.correlate(). Here is what I got:

    import numpy as np
    
    def cross_correlation(y1, y2):
        y2_normalized = (y2 - y2.mean()) / y2.std() / np.sqrt(y2.size)
        y1_m = signal.correlate(y1, np.ones(y2.size), 'valid') ** 2 / y2_normalized.size
        y1_m2 = signal.correlate(y1 ** 2, np.ones(y2.size), "valid")
        cross_correlation = signal.correlate(y1, y2_normalized, "valid") / np.sqrt(y1_m2 - y1_m)
    

    The code is much faster when cross-correlating audio files; the execution of the code took less than 2 seconds for the audio clips I mentioned at the beginning of my problem (that are 5 minutes and 43 seconds, respectively, sampled at 44100 Hz).

    Here is the output: The output of the cross-correlation of two audio files

    The peak has a value of 1.0, and is precise to the millisecond.

    Note: before performing the cross-correlation, I got the envelope of both of the audio files using y_env = np.abs(scipy.signal.hilbert(y)) and a low-pass filter at 50 Hz using b, a = scipy.signal.butter(2, filter_over, "low", fs=44100) and y_filt = lfilter(b, a, y_env). You can also perform a downsampling at any point in case you have very long data.

    EDIT: For posterity, I have been working on a ready-to-use, downloadable function to find the delays and plot them. Here it is: https://github.com/RomainPastureau/find_delay