I am looking to loop through existing .vtt files and read the cue data into a database.
The format of the .vtt files are:
WEBVTT FILE
line1
00:00:00.000 --> 00:00:10.000
‘Stuff’
line2
00:00:10.000 --> 00:00:20.000
Other stuff
Example with 2 lines
line3
00:00:20.00 --> 00:00:30.000
Example with only 2 digits in milliseconds
line4
00:00:30.000 --> 00:00:40.000
Different stuff
00:00:40.000 --> 00:00:50.000
Example without a head line
Originally I was trying to use ^
and $
to be quite regimented with the lines along the lines of: /^(\w*)$^(\d{2}):(\d{2}):(\d{2})\.(\d{2,3}) --> (\d{2}):(\d{2}):(\d{2})\.(\d{2,3})$^(.+)$/ims
but I struggled to get this working in the regex checker and resorted to using \s
to deal with line start/ends.
Currently I am using the following regex: /(.*)\s(\d{2}):(\d{2}):(\d{2})\.(\d{2,3}) --> (\d{2}):(\d{2}):(\d{2})\.(\d{2,3})\s(.+)/im
This partially works using online regex checkers like: https://regex101.com/r/mmpObk/3 (this example does not pick up multi-line subtitles, but does get the first line which at this point is good enough for my purpose as all subtitles are currently 1 liners). However if I put this into php (preg_match_all("/(.*)\s(\d{2}):(\d{2}):(\d{2})\.(\d{2,3}) --> (\d{2}):(\d{2}):(\d{2})\.(\d{2,3})\s(.+)/mi", $fileData, $matches)
) and dump the results I get an array of empty arrays.
What might be different between the online regex and php?
Thanks in advance for any suggestions.
EDIT--- Below is a dump of $fileData and a dump of $matches:
string(341) "WEBVTT FILE
line1
00:00:00.000 --> 00:00:10.000
‘Stuff’
line2
00:00:10.000 --> 00:00:20.000
Other stuff
Example with 2 lines
line3
00:00:20.00 --> 00:00:30.000
Example with only 2 digits in milliseconds
line4
00:00:30.000 --> 00:00:40.000
Different stuff
00:00:40.000 --> 00:00:50.000
Example without a head line"
array(11) {
[0]=>
array(0) {}
[1]=>
array(0) {}
[2]=>
array(0) {}
[3]=>
array(0) {}
[4]=>
array(0) {}
[5]=>
array(0) {}
[6]=>
array(0) {}
[7]=>
array(0) {}
[8]=>
array(0) {}
[9]=>
array(0) {}
[10]=>
array(0) {}
}
The problem with your regular expression is poor line-ending handling.
You have this at the end: \s(.+)/mi
.
This only matches 1 whitespace, but newlines can be 1 or 2 whitespaces.
To fix it, you can use \R(.+)/mi
.
It works on the website because it is normalizing your newlines into Linux-style newlines.
That is, Windows-style newlines are \r\n
(2 characters) and Linux-style are \n
(1 character).
Alternativelly, you can try this regular expression:
/(?:line(\d+)\R)?(\d{2}(?::\d{2}){2}\.\d{2,3})\s*-->\s*(\d{2}(?::\d{2}){2}\.\d{2,3})\R((?:[^\r\n]|\r?\n[^\r\n])*)(?:\r?\n\r?\n|$)/i
It looks horrible, but it works.
Note: I'm swapping between \R
and \r\n
because \R
matches the literal R
inside []
.
The data is captured like this:
You can try it on https://regex101.com/r/Yk8iD1/1
You can use the handy code generator tool to generate the following PHP:
$re = '/(?:line(\d+)\R)?(\d{2}(?::\d{2}){2}\.\d{2,3})\s*-->\s*(\d{2}(?::\d{2}){2}\.\d{2,3})\R((?:[^\r\n]|\r?\n[^\r\n])*)(?:\r?\n\r?\n|$)/i';
$str = 'WEBVTT FILE
line1
00:00:00.000 --> 00:00:10.000
‘Stuff’
line2
00:00:10.000 --> 00:00:20.000
Other stuff
Example with 2 lines
line3
00:00:20.00 --> 00:00:30.000
Example with only 2 digits in milliseconds
line4
00:00:30.000 --> 00:00:40.000
Different stuff
00:00:40.000 --> 00:00:50.000
Example without a head line';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
You can test it on http://sandbox.onlinephpfunctions.com/code/7f5362f56e912f3504ed075e7013071059cdee7b