python

How do I take raw text and make it look like a subtitle file.srt?


I intend to create an audio recognizer and take it to transform it into subtitle.srt, but I need to format it to look like a .srt subtitle file.

For example, I have this code that tries to recognize .mp4 audio previously converted to .wav format:

import speech_recognition as sr

r = sr.Recognizer()
intro = sr.AudioFile('intro.wav')

with intro as source:
    r.adjust_for_ambient_noise(source)
    audio = r.record(source)

result = r.recognize_google(audio)
print(result)

Result returned from audio recognition:

this is chapter 1 of the flask Mega tutorial welcome before we begin I want to spend a couple of minutes showing you how you can work with the core repository on GitHub while you do this tutorial...

What I still don't know how to do is know how to get the pause points and manipulate them to leave this text as a subtitle file like this:

1
00:00:01,510 --> 00:00:05,860
This is Chapter 1 of the flask make a tutorial welcome.

2
00:00:05,860 --> 00:00:11,270
Before we begin I want to spend a couple of minutes showing you how you can work.

Solution

  • EDIT: Ammended response in lieu of new context from question.

    Hi Oliveria, with the Python library you're using, what you're asking is kind of involved. As I've been asked to not provide a full implementation, I won't, but here's where you need to look.

    The library defines a layer of abstraction over the actual google API for speech recognition. Altought the API serves a JSON file upon request that does contain the timestamp data for each word, the recognize_google method -by default- discards this information, and only returns the transcript

    the full signature of the method:
    def recognize_google(self, audio_data, key=None, language="en-US", pfilter=0, show_all=False, with_confidence=False):

    Happilly you can setup the "show_all" flag to retrieve the full JSON as a dict

    result = r.recognize_google(audio_data = audio, show_all = True)

    That being said, note that know you face a new challenge: You have to extract the transcript from the dict, and some how compose timecodes from the individual words

    It appears to me that this is not the right tool for the job...

    but if you wish to pursue it anyway, some of the code writen below (in the previous iteration of this answer) can be repourpoused to acomplish part of that challenge.

    Original answer.

    If you could provide a sample of the sorce text you are trying to format it would be easier to provide a proper solution.

    However, here's an explanation of an approach you could take.

    The problem

    Piece by piece the program you're asking for should:

    1
    00:02:16,612 --> 00:02:19,376
    some text
    
    2
    00:02:19,482 --> 00:02:21,609
    some other text
    
    etc....
    

    and

    The tools

    Luckily for you all of those can be acomplished in native Python without much hussle. To keep it simple I'll offer a procedural function-based approach.

    What we need is a function that takes the text (str) for each section of the subtitles, along with it's time frame, and appends to a file the formated subs. We can achive this with Python's built-in context managers`` some for loops` and string formating.

    As it stands, we would need some way knowing witch "raw text" entry corresponds to witch section of the srt file, probably some encapsulation would prove useful but being a simple program we can do with a global variable, as the shared state is minimal.

    As no further info was provided, I'll assume you can create a ordered list of tuples that contain: the "raw text" you wish to format, the intial time code and the duration in miliseconds for each section.

    Piecing togheter a solution

    def main():
        #define globals
        RAW_TEXT_LIST : List[tuple[str,str,int]]   # you should assing to RAW_TEXT_LIST each section of the text
        # RAW_TEX_LIST =            # right here
        NAME_OF_FILE : str = "THE_NAME_OF_THE_SUB_FILE_TO_WRITE" #change this
    
        currentSection : int = 0
    
        def convertMillisToTc(millis: int) -> str:
            #utility function to convert miliseconds to timeCode hh:mm:ss,mmm
            miliseconds,seconds=divmod(int(millis/1000),60)
            minutes=int(millis/(1000*60))%60
            hours=int(millis/(1000*60*60))%24
            return f"{hours:02d}:{minutes:02d}:{seconds:02d},{miliseconds:03d}"
    
        def makeSubRipStr(rawText : str, initialTimeCode: str, durationInMiliseconds : int ) -> str: 
            currentSection+=1 # we add 1 to the currentSection counter, starting in 1.
            
            initialTimeCodeInMilis : int = sum((3600000 * int(hours), 60000 * int(minutes),1000 * int(seconds), int(miliseconds)) for hours,minutes,seconds,miliseconds in initialTimeCode.split(":"))
            finalTimeCode : str = convertMillisToTC(initialTimeCodeInMilis + durationInMiliseconds);
            formatedText : str = f'{currentSection}\n{initialTimeCodetimeCode} --> {finalTimeCode}\n{rawText}\n\n'
            return formatedText
    
    
        #Create the file and do nothing with it
        with open(file=f"./subfiles/{NAME_OF_FILE}.srt",mode="w",encoding="utf-8") as subFile: pass
    
        #open the file in "append mode and add each entry formated"
        with open(file=f"./subfiles/{NAME_OF_FILE}.srt",mode="a+",encoding="utf-8") as subFile: pass
            for sourceTuple in RAW_TEX_LIST:
                text, initialTC, duration = sourceTuple
                subFile.write(makeSubRipStr(text,initialTC,duration)) 
    
    
    if __name__ == '__main__':
        main()