how to merge multiple subtitle files in python?

I have a batch of files "srt" I want to merge

sub1.srt

1
00:00:21,601 --> 00:00:24,130
- What happened? - It's a mess, I heard.

2
00:00:24,131 --> 00:00:25,900
- What's that? - Dead bodies?

3
00:00:25,901 --> 00:00:28,839
- What's going on? - I wish I knew.

sub2.srt

1
00:00:28,840 --> 00:00:31,310
No one knows. They won't say.

2
00:00:31,311 --> 00:00:35,276
- My gosh. - How can so many die?

3
00:00:45,191 --> 00:00:46,556
When you starve,

after merge

1
00:00:21,601 --> 00:00:24,130
- What happened? - It's a mess, I heard.

2
00:00:24,131 --> 00:00:25,900
- What's that? - Dead bodies?

3
00:00:25,901 --> 00:00:28,839
- What's going on? - I wish I knew.

4
00:00:28,840 --> 00:00:31,310
No one knows. They won't say.

5
00:00:31,311 --> 00:00:35,276
- My gosh. - How can so many die?

6
00:00:45,191 --> 00:00:46,556
When you starve,

I found this script that it works The problem is in the numbers The subtitle are not in order

filenames = ['sub1.srt', 'sub2.srt']
with open('output_file.srt', 'w') as outfile:
    for fname in filenames:
        with open(fname) as infile:
            for line in infile:
                outfile.write(line)

appear unordered

how to fix?

Solution

You could try the following:

import re

re_sub_no = re.compile(r"^\s*\d+\s*$", re.MULTILINE)

def repl(match):
    global sub_no
    sub_no += 1
    return str(sub_no)

sub_no = 0
filenames = ["sub1.srt", "sub2.srt"]
with open("sub_merged.srt", "w") as fout:
    for name in filenames:
        with open(name, "r") as fin:
            fout.write(re_sub_no.sub(repl, fin.read()) + "\n\n")

The regex-pattern re_sub_no is searching for the sub-numbers, and via the re.sub the repl-function makes sure the numbering is consistent. The sub_no variable is made global because the function itself can't store the current state of the numbering otherwise. The \s*-parts of the pattern are only a precaution in case of whitespace before/after the number, maybe you don't need them (try it without).

The code above doesn't address the encoding of the files. When you run the following for the files that you've linked to

import chardet

for name in ["sub1.srt", "sub2.srt"]:
    with open(name, "br") as file:
        print(chardet.detect(file.read()))

the result will, most likely, be

{'encoding': 'UTF-8-SIG', 'confidence': 1.0, 'language': ''}
{'encoding': 'UTF-8-SIG', 'confidence': 1.0, 'language': ''}

So, the following modification should hopefully work:

...
with open("sub_merged.srt", "w", encoding="utf-8-sig") as fout:
    for name in filenames:
        with open(name, "r", encoding="utf-8-sig") as fin:
            fout.write(re_sub_no.sub(repl, fin.read()) + "\n\n")

For future use, you could combine that into:

import chardet, re

filenames = ["sub1.srt", "sub2.srt"]
encodings = []
for name in filenames:
    with open(name, "br") as file:
        encodings.append(chardet.detect(file.read())["encoding"])

re_sub_no = re.compile(r"^\d+\s*$", re.MULTILINE)

def repl(match):
    global sub_no
    sub_no += 1
    return str(sub_no)

sub_no = 0
with open("sub_merged.srt", "w", encoding=encodings[0]) as fout:
    for name, encoding in zip(filenames, encodings):
        with open(name, "r", encoding=encoding) as fin:
            fout.write(re_sub_no.sub(repl, fin.read()) + "\n\n")