datasetspeech-recognitionmozilla-deepspeech

Forced alignment using Aeneas with multible aeneas text files


We have started a project to create a Turkish speech recognition dataset to use with DeepSpeech.

We finished preprocessing task of Ebook. But we couldn't finish the forced alignment process with Aeneas.

According to its tutorials for forced alignment, you need a text file and its recorded audio file. While preprocessing of Ebook we have created 430 text files which are edited and cleaned for aeneas format (divided into paragraphs and sentences using nltk library).

But, while processing our created task object and creating its output file (Json file), we couldn't merge output files. For every Aeneas file, it starts from the beginning of the audio file.

It seems we need to split our audio file to 430 parts, but it is not a easy process.

I tried to merge Json files with:

import json
import glob

result = []
for f in glob.glob("*.json"):
  with open(f, "rb") as infile:
    result.append(json.load(infile))
with open("merged_file.json", "w") as outfile:
  json.dump(result, outfile)

But it didn't work, because while forced alignment process, aeneas starting from the beginning of the audio file for each aeneas text files.

Is it possible to create a task object which includes all 430 aeneas text files and append them into one output file (Json file) with respect to their timings ( their seconds ) also using one audio file?

Our task object:

# create Task object
config_string = "task_language=tur|is_text_type=plain|os_task_file_format=json"
task = Task(config_string=config_string)
task.audio_file_path_absolute = "/content/gdrive/My Drive/TASR/kitaplar/nutuk/Nutuk_sesli.mp3"
task.text_file_path_absolute = "/content/gdrive/My Drive/TASR/kitaplar/nutuk/nutuk_aeneas_data_1.txt")
task.sync_map_file_path_absolute = "/content/gdrive/My Drive/TASR/kitaplar/nutuk/syncmap.json")

Btw, we are working on Google Colab with python 3.


Solution

  • I figured out to solve my question, and found a solution.

    Instead of combining JSON files, I could combine aeneas text files with this code:

    with open("/content/gdrive/My Drive/TASR/kitaplar/{0}/{1}/{2}_aeneas_data_all.txt".format(book_name,chapter, 
    book_name), "wb") as outfile:
        for i in range(1,count-1):
          file_name = "/content/gdrive/My Drive/TASR/kitaplar/{0}/{1}/{2}_aeneas_data_{3}.txt".format(book_name, chapter, book_name, str(i))
      #print(file_name)
          with open(file_name, "rb") as infile:
            outfile.write(infile.read())
    

    So after combining aeneas files, I can create a json file which contains all paragraphs.