I'm trying to do bulk text to audio conversion without using cloud services like AWS polly.
gtts gives good quality text to speech but requires an internet connection to get results. Running tts for individual strings using this code for example is very slow.
post_id
and summary
are 2 columns of a dataframe that have ids and summaries of news articles respectively. I'm running in Visual Studio Code in a Python file on Windows 10.
from gtts import gTTS
for row_id, row_summary in zip(df.post_id, df.summary):
tts = gTTS(row_summary, lang='en', tld='ca')
tts.save('.\summary_audio\gtts_summary_'+str(row_id)+'.mp3')
This works but takes half an hour for 100 summaries, which is slow.
I've tried using pooling like so:
from gtts import gTTS
from multiprocessing import Pool, get_context
def generate_audio(row_id, row_summary):
tts = gTTS(row_summary, lang='en', tld='ca')
file_name = '.\summary_audio\gtts_summary_'+str(row_id)+'.mp3'
tts.save(file_name)
return None
pool_input = list(zip(df.post_id, df.summary))
with get_context("spawn").Pool() as p:
p.starmap(generate_audio, pool_input)
But I end up getting this error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 125, in _main
prepare(preparation_data)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 265, in run_path
return _run_module_code(code, init_globals, run_name,
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "c:\Users\Admin\Desktop\Projects\NLP_Cron\cron_audio.py", line 100, in <module>
[os.remove(f) for f in os.listdir() if f.endswith(".mp3")]
File "c:\Users\Admin\Desktop\Projects\NLP_Cron\cron_audio.py", line 100, in <listcomp>
[os.remove(f) for f in os.listdir() if f.endswith(".mp3")]
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'gtts_summary_4a063dea-9469-4633-87bc-b4a2f2e3f1ba.mp3'
Does this mean it won't run in a pool unless the developer of the gtts library enables it? Or am I simply doing something wrong here?
Edit: Also before doing this, I'm creating a folder called summary_audio to save the files to if it doesn't exist.
I'm answering my own question because i realize I was running the pool process inside a function. The pool statement has to run in the main section. Only then it works as intended. Here is a full example.
from multiprocessing import Pool, get_context
import pandas as pd
from gtts import gTTS
df_col1 = [1,3,4,5,6,7]
df_col2 = ["Hi", "Bye", "Why?", "Cry", "Die", "Pie", "Shy"]
df = pd.DataFrame(zip(df_col1, df_col2), columns = ['post_id', 'summary'])
def generate_audio(row_id, row_summary):
tts = gTTS(row_summary, lang='en', tld='ca')
file_name = '.\summary_audio\gtts_summary_'+str(row_id)+'.mp3'
return tts.save(file_name)
if __name__ == '__main__':
pool_input = list(zip(df.post_id, df.summary))
with Pool(3) as p:
p.starmap(generate_audio, pool_input)
p.join()
Also it wont make sense to use gtts with pooling in the first place because you will get rate limited.