phpregexregex-groupsrtvtt

RegEx matching for SRT and VTT syntax from subtitles


I am having subtitles in both srt and vtt format where I need to match and remove format specific syntax and just get clean lines with text.

I have come up with this regex: /\n?\d*?\n?^.* --> [012345]{2}:.*$/m

sample content (mix both srt and vtt):

1
00:00:04,019 --> 00:00:07,299
line1
line2

2
00:00:07,414 --> 00:00:09,155
line1

00:00:09,276 --> 00:00:11,429
line1

00:00:11,549 --> 00:00:14,874
line1
line2

This is matching both subtitle number and timing as expected simulated in https://regex101.com/r/zRsRMR/2/

But when used in the code itself (even using directly the generated code snippet from https://regex101.com), that will only match timing, not subtitle number.

See output:

array (5)
0 => array (1)
0 => "00:00:04,019 --> 00:00:07,299
" (30)
1 => array (1)
0 => "
00:00:07,414 --> 00:00:09,155
" (31)
2 => array (1)
0 => "
00:00:09,276 --> 00:00:11,429
" (31)
3 => array (1)
0 => "
00:00:11,549 --> 00:00:14,874
" (31)
4 => array (1)
0 => "
00:00:11,549 --> 00:00:14,874
" (31)

Can be tested on: http://sandbox.onlinephpfunctions.com/code/dec294251b879144f40a6d1bdd516d2050321242

The goal is to match even the subtitle number, for example first expected match should be:

1
00:00:04,019 --> 00:00:07,299

Solution

  • You could make this part of your expression \n?\d*?\n? an optional group to match 1+ digits followed by a newline. The character class [012345] might also be written as [0-5]

    You could update your expression to:

    ^(?:\d+\n)?.*\h+-->\h+[0-5]{2}:.*$
    

    Regex demo | Php demo