
Split speech audio file on words in python

I feel like this is a fairly common problem but I haven't yet found a suitable answer. I have many audio files of human speech that I would like to break on words, which can be done heuristically by looking at pauses in the waveform, but can anyone point me to a function/library in python that does this automatically?


  • An easier way to do this is using pydub module. recent addition of silent utilities does all the heavy lifting such as setting up silence threahold , setting up silence length. etc and simplifies code significantly as opposed to other methods mentioned.

    Here is an demo implementation , inspiration from here


    I had a audio file with spoken english letters from A to Z in the file "a-z.wav". A sub-directory splitAudio was created in the current working directory. Upon executing the demo code, the files were split onto 26 separate files with each audio file storing each syllable.

    Observations: Some of the syllables were cut off, possibly needing modification of following parameters,

    One may want to tune these to one's own requirement.

    Demo Code:

    from pydub import AudioSegment
    from pydub.silence import split_on_silence
    sound_file = AudioSegment.from_wav("a-z.wav")
    audio_chunks = split_on_silence(sound_file, 
        # must be silent for at least half a second
        # consider it silent if quieter than -16 dBFS
    for i, chunk in enumerate(audio_chunks):
        out_file = ".//splitAudio//chunk{0}.wav".format(i)
        print "exporting", out_file
        chunk.export(out_file, format="wav")


    exporting .//splitAudio//chunk0.wav
