regexregex-lookaroundsregex-grouppcreregex-negation

Can I get help generating the correct regex to match a list of test strings?


At work I came upon a problem that required parsing of filenames that did not have a consistent pattern. I tried for a significant amount of time to implement a pure regex solution but settled for some sting splits in code, with regex applied to the split parts. It worked out well, but I want to revisit this and learn from it, if there is a valid way to do this all in regex so I can learn.

I cannot sort out how to not match [\w\.\-]+ if there is a followingv?\d+\.\d+ following after. I've looked into lookahead / lookbehind and can't really understand how to apply either to this situation.

my "closest" attempts: the regex101 links contain more detail around matches and errors.

https://regex101.com/r/nz4Bxl/1

(?P<name>([\w\-]+)+\.?)(?P<version>(\.?\d+\.?)+\.?)(\.[a-z]{3,})/gm

With this pattern, the majority of strings are matched (although most versions are missing the leading digit and period), except for: appname.v1.2.3.exe (no name matched, version missing 1.) appname.v1.02.03.exe (no name matched, version missing 1.)

https://regex101.com/r/FyG7Mm/1/

(?P<name>([\w\-]+)]?)(?P<version>((\d+\.){1,}\d+\.?))\.[a-z]{3,}/gm

With this pattern, the majority of the input strings are matched except for:app-name.1.2.3.exe (no matches)appname.v1.2.3.exe (version matched, but no name match)appname.v1.02.03.exe (version matched, but no name match)appname71name.1.2.3.exe (no matches)

Neither of the above patterns are matching everything input text and desired results below:

input --> desired result

app-name.1.2.3.exe --> name: app-name version: 1.2.3
appname.v1.2.3.exe --> name: appname version: 1.2.3
appname.v1.02.03.exe --> name: appname version: 1.02.03
appname71name.1.2.3.exe --> name: appname71name version: 1.2.3
appname71.1.2.3.exe --> name: appname71 version: 1.2.3
appname-1.2.3.exe --> name: appname version: 1.2.3
appname-v1.2.3.exe --> name: appname version: 1.2.3
appname-v1.02.03.exe --> name: appname version: 1.02.03
appname71name-1.2.3.exe --> name: appname71name version: 1.2.3
appname71-1.2.3.exe --> name: appname71 version: 1.2.3
appname_1.2.3.exe --> name: appname version: 1.2.3
appname_v1.2.3.exe --> name: appname version: 1.2.3
appname_v1.02.03.exe --> name: appname version: 1.02.03
appname71name_1.2.3.exe --> name: appname71name version: 1.2.3
appname71_1.2.3.exe --> name: appname71 version: 1.2.3

Solution

  • You could start the match with repeating word characters without an underscore, and then repeat the match by - or _ which is not directly followed by an optional v 1+ digits and a .

    ^(?P<name>[^\W_]+(?:[-_](?!v?\d+\.)[^\W_]+)*)[-_.]?v?(?P<version>\d+(?:\.\d+)*)\.[a-z]{3,}$
    

    The pattern in parts matches:

    See a regex demo