pythonsignal-processingvoice-recognitionphoneme

convert sound to list of phonemes in python


How do I convert any sound signal to a list phonemes?

I.e the actual methodology and/or code to go from a digital signal to a list of phonemes that the sound recording is made from.
eg:

lPhonemes = audio_to_phonemes(aSignal)

where for example

from scipy.io.wavfile import read
iSampleRate, aSignal = read(sRecordingDir)

aSignal = #numpy array for the recorded word 'hear'
lPhonemes = ['HH', 'IY1', 'R']

I need the function audio_to_phonemes

Not all sounds are language words, so I cannot just use something that uses the google API for example.

Edit
I don't want audio to words, I want audio to phonemes. Most libraries seem to not output that. Any library you recommend needs to be able to output the ordered list of phonemes that the sound is made up of. And it needs to be in python.

I would also love to know how the process of sound to phonemes works. If not for implementation purposes, then for interest sake.


Solution

  • Accurate phoneme recognition is not easy to achieve because phonemes themselves are pretty loosely defined. Even in good audio the best possible systems today have about 18% phoneme error rate (you can check LSTM-RNN results on TIMIT published by Alex Graves).

    In CMUSphinx phoneme recognition in Python is done like this:

    from os import environ, path
    
    from pocketsphinx.pocketsphinx import *
    from sphinxbase.sphinxbase import *
    
    MODELDIR = "../../../model"
    DATADIR = "../../../test/data"
    
    # Create a decoder with certain model
    config = Decoder.default_config()
    config.set_string('-hmm', path.join(MODELDIR, 'en-us/en-us'))
    config.set_string('-allphone', path.join(MODELDIR, 'en-us/en-us-phone.lm.dmp'))
    config.set_float('-lw', 2.0)
    config.set_float('-beam', 1e-10)
    config.set_float('-pbeam', 1e-10)
    
    # Decode streaming data.
    decoder = Decoder(config)
    
    decoder.start_utt()
    stream = open(path.join(DATADIR, 'goforward.raw'), 'rb')
    while True:
      buf = stream.read(1024)
      if buf:
        decoder.process_raw(buf, False, False)
      else:
        break
    decoder.end_utt()
    
    hypothesis = decoder.hyp()
    print ('Phonemes: ', [seg.word for seg in decoder.seg()])
    

    You need to checkout latest pocketsphinx from github in order to run this example. Result should look like this:

      ('Best phonemes: ', ['SIL', 'G', 'OW', 'F', 'AO', 'R', 'W', 'ER', 'D', 'T', 'AE', 'N', 'NG', 'IY', 'IH', 'ZH', 'ER', 'Z', 'S', 'V', 'SIL'])
    

    See also the wiki page