[SOLVED] Trouble parsing a large M3U file

Trouble parsing a large M3U file

A M3U file is a playlist file, it contains a list of entries describing media files, their name, id, categories, etc. On the first line is the metadata and the second line is the file or streaming URL.

Exemple:

#EXTINF:-1 tvg-id="ChannelName" tvg-name="|FR| Channel" tvg-logo="http://logo" timeshift="1" group-title="|FR| FrenchChannel",|FR| Channel Fullname
URL

My file contains around 90,000 entries and 160,000 lines. Weighting around 20Mb.

I want to parse this file, and get every entry. I tried using this regex :

'(.+?),(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)(.+)\s*(.+)\s*'

It gets me the metadata, the full name and the URL in different matching groups. It works fine on different subsets, both 30,000 and 50,000 lines. However, when working on the full set, the matching takes way too long. At the point that I couldn't let it finish and had to terminate it.

I cannot get this parsing to work, is this a design pattern issue or just the regex being too slow? I'm quite confused.

Solution

One option might be to repeat the key value pairs instead of using the non greedy .+? to prevent unnecessary backtracking and omit the positive lookahhead (?=:

^(#\S+(?:\s+[^\s="]+="[^"]+")+),(.*)\s*(.*)

Explanation

^ Start of string
( First capturing group
- #\S+ Match # followed by 0+ times a non whitespace char
- (?: [^\s="]+="[^"]+")+ Repeat 1+ times a key value pair preceded by 1+ times a whitespace char
) Close group 1
,(.*) Match a comma and capture 0+ times any char in group 2
\s* Match 0+ times a whitespace char
(.*) Capture in group 3 matching any char 0+ times

Regex demo