Need to modify all URLs in an m3u8 file with PHP

I have an m3u8 file which goes something like this:

#EXTM3U
#EXT-X-VERSION:3
#EXT-X-MEDIA-SEQUENCE:15084
#EXT-X-TARGETDURATION:4
#EXTINF:4.004,
radio002live-15084.ts
#EXTINF:4.004,
radio002live-15085.ts
(and so on)

What I ideally want to happen is to have all of those file names prefixed with a URL, but only if they don't start with HTTP(S) already. Then URL encode those, add another thing in front of them, and then return that so ideally the file looks like this:

#EXTM3U
#EXT-X-VERSION:3
#EXT-X-MEDIA-SEQUENCE:15084
#EXT-X-TARGETDURATION:4
#EXTINF:4.004,
proxy.php?http%3A%2F%2Fsomeurl.tld%2Fpath%2Fradio002live-15084.ts
#EXTINF:4.004,
proxy.php?http%3A%2F%2Fsomeurl.tld%2Fpath%2Fradio002live-15085.ts
(and so on)

So far, I've tried turning the thing into an array (one line = one key) but I realized this might not cover every m3u8 I need to parse (what if one has the URL on the same line?) and I can't seem to get past detecting what doesn't start with what regardless.

Ideally, this should work in PHP 5-ish.

Solution

$contents = file_get_contents('/tmp/m3u8');
function httpize($matches)
{
    if(preg_match('@(?:^|[?=])https?[:%]@', $matches[0])) return $matches[0];
    return 'proxy.php?http%3A%2F%2Fsomeurl.tld%2Fpath%2F'.urlencode($matches[0]);
}
echo preg_replace_callback('@^[^#].*$@m', 'httpize', $contents);

The main entry point is preg_replace_callback(), called with parameters:

'@^[^#].*$@m', its regular expression
- '@…@m' because we chose the @ delimiter (the usual delimiter is /, but as we will handle some http:// with /s in it, use something else to avoid getting lost between all those /s).
  The m at the end means multiline, thus consider each line separately (it will help for the ^ we'll talk about just next)
- So the real regular expression is ^[^#].*$
  ^ and $ respectively mean "anchored to start of string to analyze" and "anchored to end of string to analyze", but as we used the m modifier, "string to analyze" is "each line of the contents to analyze"
  And between those two anchors, we want first a [^#] (= "any character different from #; […] means "any of the characters in …", and the ^ at its start negates it), followed by .*, with the . meaning "any character (except a newline)" and * "repeated as long as we can".
  So ^[^#].*$ means: "a line, starting with not a #, then all characters we can find, until reaching the end of the line".
  This will be our filenames detector.
'httpize': the function to call when it finds a matching line
$contents: the string to analyze (that we obtained from our file_get_contents())

So what does happen when preg_replace_callback() finds a match?
It calls httpize, passing it a parameter containing the found string, wrapped in an array (array(0 => 'radio002live-15084.ts')). So to get the string found, we'll access $matches[0].
Now we're in charge of returning the replacement for what we received as a parameter (remember we're the callback for preg_replace_callback()? It's waiting for us to return).

We start with an if(preg_match('@(?:^|[?=])https?[:%]@', $matches[0])) return $matches[0];.
preg_match() will try to find in $matches[0] a string corresponding to the regular expression (?:^|[?=])https?[:%] (noticed the @…@ had no m modifier? Of course: we're now in a function that received a monoline string, so no need of m).

?:^| and ?=] are of course portraits of Riquet with the Tuft… Well, no, not at all, so let's restart.
(…|…) means "either … or …", but, as we will not reuse the contents of the parenthesis after that, we can tell the regex engine to not remember those contents (only use parenthesis as a group, not a capturing parenthesis) thanks to the ?: right after the (
- The first … is ^, which means "start of contents"
- the second part (after the |) is [?=], meaning "a ? or an ="
just after the closing ) we get http (I'll let the s for later)
so the block (?:^|[?=])http means "the string http just after: either the start of the contents, or an ? or a ="
after that we get our s?. ? after a character means "optionally". Thus https? means "http or https"
finally we have [:%], which means "one character, either : of %

Thus this preg_match() will return true if it finds (in the line it received from the preg_replace_callback()) an http or https, either at the start of the line or preceded by an ? or an =, and followed by either an : or an % (the start of an URL-encoded :).

If it finds it, it means the file has already been wrapped into an URL. So return $matches[0]; without a change.

On the other hand, if preg_match() returns false (and the if doesn't enter the return), our httpize() function will transform the received string by urlencode()ing it, and preceding it by a fixed URL prefix.

Demo

with input:

#EXTM3U
#EXT-X-VERSION:3
#EXT-X-MEDIA-SEQUENCE:15084
#EXT-X-TARGETDURATION:4
#EXTINF:4.004,
radio002live-15084.ts
#EXTINF:4.004,
radio002live-15085.ts
#EXTINF:4.004,
thisoneisalreadyon?url=http%3A%2F%2Fsomeurl.tld%2Fradio002live-15085.ts
http://radio002live-15085.ts
httpradio002live-15085.ts
with spaces.ts

will return (with PHP 5.6.25):

#EXTM3U
#EXT-X-VERSION:3
#EXT-X-MEDIA-SEQUENCE:15084
#EXT-X-TARGETDURATION:4
#EXTINF:4.004,
proxy.php?http%3A%2F%2Fsomeurl.tld%2Fpath%2Fradio002live-15084.ts
#EXTINF:4.004,
proxy.php?http%3A%2F%2Fsomeurl.tld%2Fpath%2Fradio002live-15085.ts
#EXTINF:4.004,
thisoneisalreadyon?url=http%3A%2F%2Fsomeurl.tld%2Fradio002live-15085.ts
http://radio002live-15085.ts
proxy.php?http%3A%2F%2Fsomeurl.tld%2Fpath%2Fhttpradio002live-15085.ts
proxy.php?http%3A%2F%2Fsomeurl.tld%2Fpath%2Fwith+spaces.ts