pythonaudiosignal-processingvoicephoneme

categorizing short audio samples


I have a small number of similar types of sounds (I shall refer to these as DB_sounds) to which I need to match a recording (Rec_sounds). Each Rec_sound is short and unique and needs to be matched to its corresponding DB_sound. How do I go about matching them?

To illustrate my problem, consider the following:
Bob, with a deep voice in room A (with some background noise) says Ma
Alice, with high voice in room B says Eh
A Baby is learning to speak. His first word is Eh

Ma and Eh are 2 different types of DB_sounds, so I have to return 2 different results. I have several DB_sound samples of different people saying Ma and Eh to compare the Rec_sounds to

The sounds that I am dealing with are voice recordings of single syllables like la, ba, ne, eh, ma etc.

How should I tackle this?
I don't think audio fingerprinting will work (see spectrogram), and existing voice recognition software like this google api integration in python don't work since I am not trying to recognize human language, but just sounds.

I don't mind building something from the ground up, just point me in a direction you think will work, and please add plenty justification for why you think so.

Spectrograms of 8 samples of a baby saying EH enter image description here

Time domain graphs of 8 samples of a baby saying EH enter image description here


Solution

  • If you just want to recognize sounds, I would start with a simple procedure:

    1. Crop silence from each sound sample (simple energy treshold).
    2. Compute Audio Features for each sample of your database (e.g. MFCCs).
    3. Perform a cross-validated classification procedure to map the audio features to the sound category you want to recognize.

    Helpful Python Libs: scipy for reading wav files, essentia for audio feature extraction, scikit-learn for classification and other machine learning.