I have a small number of similar types of sounds (I shall refer to these as DB_sounds) to which I need to match a recording (Rec_sounds). Each Rec_sound is short and unique and needs to be matched to its corresponding DB_sound. How do I go about matching them?
To illustrate my problem, consider the following:
Bob, with a deep voice in room A (with some background noise) says Ma
Alice, with high voice in room B says Eh
A Baby is learning to speak. His first word is Eh
Ma and Eh are 2 different types of DB_sounds, so I have to return 2 different results. I have several DB_sound samples of different people saying Ma and Eh to compare the Rec_sounds to
The sounds that I am dealing with are voice recordings of single syllables like la, ba, ne, eh, ma etc.
How should I tackle this?
I don't think audio fingerprinting will work (see spectrogram), and existing voice recognition software like this google api integration in python don't work since I am not trying to recognize human language, but just sounds.
I don't mind building something from the ground up, just point me in a direction you think will work, and please add plenty justification for why you think so.
Spectrograms of 8 samples of a baby saying EH
Time domain graphs of 8 samples of a baby saying EH
If you just want to recognize sounds, I would start with a simple procedure:
Helpful Python Libs: scipy for reading wav files, essentia for audio feature extraction, scikit-learn for classification and other machine learning.