Big picture: Trying to identify proxy frauds in video interviews.
I have video clips of interviews. Each person has 2 or more interviews. As a first step I am trying to extract the audio from the interviews and trying to match them and identify if audio is from the same person.
I used python library librosa to parse the audio files and generate MFCC and chroma_cqt features of those files. I went ahead to also create a similarity matrix for those files. I want to convert this similarity matrix to a score between 0 to 100 where 100 is perfect match and 0 is totally different. After which I can identify a threshold and provide labels to the audio files.
Code:
import librosa
hop_length = 1024
y_ref, sr1 = librosa.load(r"audio1.wav")
y_comp, sr2 = librosa.load(r"audio2.wav")
chroma_ref = librosa.feature.chroma_cqt(y=y_ref, sr=sr1, hop_length=hop_length)
chroma_comp = librosa.feature.chroma_cqt(y=y_comp, sr=sr2, hop_length=hop_length)
mfcc1 = librosa.feature.mfcc(y_ref, sr1, n_mfcc=13)
mfcc2 = librosa.feature.mfcc(y_comp, sr2, n_mfcc=13)
# Use time-delay embedding to get a cleaner recurrence matrix
x_ref = librosa.feature.stack_memory(chroma_ref, n_steps=10, delay=3)
x_comp = librosa.feature.stack_memory(chroma_comp, n_steps=10, delay=3)
sim = librosa.segment.cross_similarity(x_comp, x_ref, metric='cosine')
The task of identifying who is talking is called Speaker Identification. Checking whether two audio clips have the same speaker Speaker Verification. If there are multiple speakers in dialog, then it may also be relevant to do Speaker Diarization, finding out who-talks-when. That would enable focus on the interview subject and not the interviewer.
Speaker recognition tasks like these are best solved with a deep neural network, as it is quite difficult task to separate the speaker from the words that are spoken. The models generally output a speaker embedding - a vector representation that encodes similarity of different person's speech. Then one can apply a simple similarity metric on this representation, such as cosine distance.
There are pretrained models available for this. For example in pyannote-audio and in SpeechBrain.