I have a batch of files "srt" I want to merge
sub1.srt
1
00:00:21,601 --> 00:00:24,130
- What happened? - It's a mess, I heard.
2
00:00:24,131 --> 00:00:25,900
- What's that? - Dead bodies?
3
00:00:25,901 --> 00:00:28,839
- What's going on? - I wish I knew.
sub2.srt
1
00:00:28,840 --> 00:00:31,310
No one knows. They won't say.
2
00:00:31,311 --> 00:00:35,276
- My gosh. - How can so many die?
3
00:00:45,191 --> 00:00:46,556
When you starve,
after merge
1
00:00:21,601 --> 00:00:24,130
- What happened? - It's a mess, I heard.
2
00:00:24,131 --> 00:00:25,900
- What's that? - Dead bodies?
3
00:00:25,901 --> 00:00:28,839
- What's going on? - I wish I knew.
4
00:00:28,840 --> 00:00:31,310
No one knows. They won't say.
5
00:00:31,311 --> 00:00:35,276
- My gosh. - How can so many die?
6
00:00:45,191 --> 00:00:46,556
When you starve,
I found this script that it works The problem is in the numbers The subtitle are not in order
filenames = ['sub1.srt', 'sub2.srt']
with open('output_file.srt', 'w') as outfile:
for fname in filenames:
with open(fname) as infile:
for line in infile:
outfile.write(line)
appear unordered
1
2
3
1
2
3
how to fix?
You could try the following:
import re
re_sub_no = re.compile(r"^\s*\d+\s*$", re.MULTILINE)
def repl(match):
global sub_no
sub_no += 1
return str(sub_no)
sub_no = 0
filenames = ["sub1.srt", "sub2.srt"]
with open("sub_merged.srt", "w") as fout:
for name in filenames:
with open(name, "r") as fin:
fout.write(re_sub_no.sub(repl, fin.read()) + "\n\n")
The regex-pattern re_sub_no
is searching for the sub-numbers, and via the re.sub
the repl
-function makes sure the numbering is consistent. The sub_no
variable is made global
because the function itself can't store the current state of the numbering otherwise. The \s*
-parts of the pattern are only a precaution in case of whitespace before/after the number, maybe you don't need them (try it without).
The code above doesn't address the encoding of the files. When you run the following for the files that you've linked to
import chardet
for name in ["sub1.srt", "sub2.srt"]:
with open(name, "br") as file:
print(chardet.detect(file.read()))
the result will, most likely, be
{'encoding': 'UTF-8-SIG', 'confidence': 1.0, 'language': ''}
{'encoding': 'UTF-8-SIG', 'confidence': 1.0, 'language': ''}
So, the following modification should hopefully work:
...
with open("sub_merged.srt", "w", encoding="utf-8-sig") as fout:
for name in filenames:
with open(name, "r", encoding="utf-8-sig") as fin:
fout.write(re_sub_no.sub(repl, fin.read()) + "\n\n")
For future use, you could combine that into:
import chardet, re
filenames = ["sub1.srt", "sub2.srt"]
encodings = []
for name in filenames:
with open(name, "br") as file:
encodings.append(chardet.detect(file.read())["encoding"])
re_sub_no = re.compile(r"^\d+\s*$", re.MULTILINE)
def repl(match):
global sub_no
sub_no += 1
return str(sub_no)
sub_no = 0
with open("sub_merged.srt", "w", encoding=encodings[0]) as fout:
for name, encoding in zip(filenames, encodings):
with open(name, "r", encoding=encoding) as fin:
fout.write(re_sub_no.sub(repl, fin.read()) + "\n\n")