I have been pulling my hair on a problem, and all of the answers I found on StackOverflow so far did not help - so I am asking for your help.
I would like to create a function that finds the exact timestamp where an audio excerpt starts in a larger audio file. For test purposes, I used a 5 minutes audio file and a 43 second excerpt of it. Below, I aligned the two audio files in Audacity: the excerpt starts exactly at 00:01:55.554920.
I would also like the function to return a value if, and only if it has a confidence value over a certain threshold, that would actually be a parameter of the function. The way I intend to do it is by checking if the correlation coefficient between the two aligned signals is over the given threshold.
In other words, here is a simplified version of the code:
find_excerpt_starting_sample(original_audio, excerpt, threshold):
# Find the cross-correlation coefficients for each lag
xcorr = cross_correlation(original_audio, excerpt)
# Return the lag of the max correlation if it is over threshold
if np.max(xcorr) > threshold:
return np.argmax(xcorr)
else:
raise Exception("No correlation over threshold found.")
I have been having a lot of troubles finding the right cross_correlation
function, because none of my attempts returned an array that would be between 0 and 1.
As my attempts on the audio files have been inconclusive, I have tried to perform the same thing on two numerical arrays:
y1 = [2, 22, 14, 8, 0, 4, 8, 16, 26, 6, 12, 14, 16, 2, 6]
y2 = [4, 8, 16, 26, 6, 12]
Here, y2 contains a subset of y1 (starting at index 5). In order to make sure that the function works independently of the amplitude scale, I halved all of the values of y2:
y1 = [2, 22, 14, 8, 0, 4, 8, 16, 26, 6, 12, 14, 16, 2, 6]
y2 = [2, 4, 8, 13, 3, 6]
I would like to create a cross-correlation function that returns an array where the value at lag 5 is 1.
If we just do a simple correlation and slide the excerpt along the original audio, it works:
import numpy as np
import matplotlib as plt
corr = np.zeros(len(y1) - len(y2))
for i in range(len(y1) - len(y2)):
corr[i] = np.corrcoef(y1[i:i+len(y2)], y2)[0][1]
print(corr)
plt.plot(corr)
plt.show()
The output is:
[ 0.18961375 -0.71250433 -0.56075283 -0.08468414 0.21913077 1. -0.04179451 -0.46803451 -0.24815461]
The problem is that this technique is really, really not efficient for longer arrays.
Now, instead of reinventing the wheel, I started using one of the main solutions found on Stack Overflow, i.e. the correlate function of scipy.signal. It returns values where the proper lag is found. However, as is, because it performs a convolution, there is no way to quantify the correlation.
from scipy import signal
xcorr = signal.correlate(y1, y2, mode="full")
lags = signal.correlation_lags(len(y1), len(y2), mode="full")
print(xcorr)
print(lags)
plt.plot(lags, xcorr)
plt.show()
The output is:
[ 12 138 176 392 390 332 224 232 356 402 596 486 478 414 422 252 186 88 28 12]
[-5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14]
I saw a few solutions, but they do not work as I intend.
First, a solution here suggests to normalize the coefficient using this function:
corr = signal.correlate(y1 / np.std(y1), y2 / np.std(y2), 'full') / min(len(y1), len(y2))
lags = signal.correlation_lags(len(y1), len(y2), mode="full")
print(c)
plt.plot(lags, c)
plt.show()
The output is:
[0.0736392 0.84685082 1.08004163 2.40554727 2.39327407 2.03735126 1.37459844 1.42369124 2.18462966 2.46691327 3.65741371 2.98238769 2.93329489 2.54055247 2.58964528 1.54642325 1.14140763 0.54002082 0.17182481 0.0736392 ]
As you can see, the maximum value is not 1, but 3.65741371.
I then tried another solution found here:
y1n = y1 / np.std(y1)
y2n = y2 / np.std(y2)
xcorr = signal.correlate(y1n, y2n, mode="full")
lags = signal.correlation_lags(len(y1), len(y2), mode="full")
print(xcorr)
plt.plot(lags, xcorr)
plt.show()
The output is:
[ 0.44183521 5.08110495 6.48024979 14.43328362 14.35964442 12.22410756 8.24759064 8.54214745 13.10777798 14.80147963 21.94448224 17.89432612 17.59976931 15.24331484 15.53787165 9.27853947 6.8484458 3.24012489 1.03094883 0.44183521]
Once again, the max value for the cross-correlation is not 1 but 21.94448224
There is a lot I do not know about correlation - I dug into it, but before going deeper I'm asking, if one of you happened to be able to point me in the right direction, and what I did wrong so far.
Thanks a lot!
I adapted the code from @Onyambu (thanks again for your help!) in order to replace the convolutions by signal.correlate()
. Here is what I got:
import numpy as np
def cross_correlation(y1, y2):
y2_normalized = (y2 - y2.mean()) / y2.std() / np.sqrt(y2.size)
y1_m = signal.correlate(y1, np.ones(y2.size), 'valid') ** 2 / y2_normalized.size
y1_m2 = signal.correlate(y1 ** 2, np.ones(y2.size), "valid")
cross_correlation = signal.correlate(y1, y2_normalized, "valid") / np.sqrt(y1_m2 - y1_m)
The code is much faster when cross-correlating audio files; the execution of the code took less than 2 seconds for the audio clips I mentioned at the beginning of my problem (that are 5 minutes and 43 seconds, respectively, sampled at 44100 Hz).
The peak has a value of 1.0, and is precise to the millisecond.
Note: before performing the cross-correlation, I got the envelope of both of the audio files using y_env = np.abs(scipy.signal.hilbert(y))
and a low-pass filter at 50 Hz using b, a = scipy.signal.butter(2, filter_over, "low", fs=44100)
and y_filt = lfilter(b, a, y_env)
. You can also perform a downsampling at any point in case you have very long data.
EDIT: For posterity, I have been working on a ready-to-use, downloadable function to find the delays and plot them. Here it is: https://github.com/RomainPastureau/find_delay