pythonmultiprocessingpoolgtts

How to run a multi-processing pool using gtts in python?


I'm trying to do bulk text to audio conversion without using cloud services like AWS polly.

gtts gives good quality text to speech but requires an internet connection to get results. Running tts for individual strings using this code for example is very slow.

post_id and summary are 2 columns of a dataframe that have ids and summaries of news articles respectively. I'm running in Visual Studio Code in a Python file on Windows 10.

from gtts import gTTS

for row_id, row_summary in zip(df.post_id, df.summary):
        tts = gTTS(row_summary, lang='en', tld='ca')
        tts.save('.\summary_audio\gtts_summary_'+str(row_id)+'.mp3')

This works but takes half an hour for 100 summaries, which is slow.

I've tried using pooling like so:

from gtts import gTTS
from multiprocessing import Pool, get_context

def generate_audio(row_id, row_summary):
         tts = gTTS(row_summary, lang='en', tld='ca')
         file_name = '.\summary_audio\gtts_summary_'+str(row_id)+'.mp3'
         tts.save(file_name)
         return None

pool_input = list(zip(df.post_id, df.summary))
with get_context("spawn").Pool() as p:
         p.starmap(generate_audio, pool_input)

But I end up getting this error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 125, in _main
    prepare(preparation_data)
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path, 
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "c:\Users\Admin\Desktop\Projects\NLP_Cron\cron_audio.py", line 100, in <module>
    [os.remove(f) for f in os.listdir() if f.endswith(".mp3")]
  File "c:\Users\Admin\Desktop\Projects\NLP_Cron\cron_audio.py", line 100, in <listcomp>
    [os.remove(f) for f in os.listdir() if f.endswith(".mp3")]
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'gtts_summary_4a063dea-9469-4633-87bc-b4a2f2e3f1ba.mp3'

Does this mean it won't run in a pool unless the developer of the gtts library enables it? Or am I simply doing something wrong here?

Edit: Also before doing this, I'm creating a folder called summary_audio to save the files to if it doesn't exist.


Solution

  • I'm answering my own question because i realize I was running the pool process inside a function. The pool statement has to run in the main section. Only then it works as intended. Here is a full example.

    from multiprocessing import Pool, get_context
    import pandas as pd
    from gtts import gTTS
    
    df_col1 = [1,3,4,5,6,7]
    df_col2 = ["Hi", "Bye", "Why?", "Cry", "Die", "Pie", "Shy"]
    
    df = pd.DataFrame(zip(df_col1, df_col2), columns = ['post_id', 'summary'])
    
    def generate_audio(row_id, row_summary):
             tts = gTTS(row_summary, lang='en', tld='ca')
             file_name = '.\summary_audio\gtts_summary_'+str(row_id)+'.mp3'
             return tts.save(file_name)
    
    if __name__ == '__main__':
    
        pool_input = list(zip(df.post_id, df.summary))
        with Pool(3) as p:
             p.starmap(generate_audio, pool_input)
        p.join()
    

    Also it wont make sense to use gtts with pooling in the first place because you will get rate limited.