I am having subtitles in both srt and vtt format where I need to match and remove format specific syntax and just get clean lines with text.
I have come up with this regex:
/\n?\d*?\n?^.* --> [012345]{2}:.*$/m
sample content (mix both srt and vtt):
1
00:00:04,019 --> 00:00:07,299
line1
line2
2
00:00:07,414 --> 00:00:09,155
line1
00:00:09,276 --> 00:00:11,429
line1
00:00:11,549 --> 00:00:14,874
line1
line2
This is matching both subtitle number and timing as expected simulated in https://regex101.com/r/zRsRMR/2/
But when used in the code itself (even using directly the generated code snippet from https://regex101.com), that will only match timing, not subtitle number.
See output:
array (5)
0 => array (1)
0 => "00:00:04,019 --> 00:00:07,299
" (30)
1 => array (1)
0 => "
00:00:07,414 --> 00:00:09,155
" (31)
2 => array (1)
0 => "
00:00:09,276 --> 00:00:11,429
" (31)
3 => array (1)
0 => "
00:00:11,549 --> 00:00:14,874
" (31)
4 => array (1)
0 => "
00:00:11,549 --> 00:00:14,874
" (31)
Can be tested on: http://sandbox.onlinephpfunctions.com/code/dec294251b879144f40a6d1bdd516d2050321242
The goal is to match even the subtitle number, for example first expected match should be:
1
00:00:04,019 --> 00:00:07,299
You could make this part of your expression \n?\d*?\n?
an optional group to match 1+ digits followed by a newline. The character class [012345]
might also be written as [0-5]
You could update your expression to:
^(?:\d+\n)?.*\h+-->\h+[0-5]{2}:.*$
^
Start of string(?:\d+\n)?
Optional 1+ digits and newline.*\h+-->\h+ Match 0+ times any char except newline, 1+ horizontal whitespace chars,
-->` and 1+ horizontal whitespace chars[0-5]{2}:
Match 2 times 0-5.*
Match 0+ times any char except newline$
End of string