For some data processing I need to split a string into multiple items. An example of an input string is:
'one, two & three and four-five 123-456'
Now, I need to separate this string into items, where possible delimiters are ,
, &
, (space),
and
, -
. But, and this is the point where I'm stuck, it should not split on a -
when it is between two numbers.
I am using PHP and preg_split
to do the actual splitting, but I need a regex pattern to match the delimiters excluding the delimiter -
when it is between two numbers (digits, but could also be 123-456
). Suppression of spaces around each item is done with trim()
in PHP.
I am using the following regex pattern:
/(and|,|\s|&)|\D(-)\D/
The output (after using preg_split
, etc) is:
[0] => one
[1] => two
[2] => three
[3] => fou
[4] => ive
[5] => 123-456
The working is correct, but it also takes the last and first letter of the surrounding text for the -
delimiter. The item 123-456
is correct, since it should not match (and split with preg_split
) on a -
when it is immediately surrounded by a number.
Expected output is:
[0] => one
[1] => two
[2] => three
[3] => four
[4] => five
[5] => 123-456
Any help is appreciated, if any information is lacking let me know and I'll update my question.
What you want to use is lookahead and lookbehind (more generally known as lookaround):
/and|,|\s|&|(?<!\d)-(?!\d)/
What this will do is exactly what the name implies - look around to check if the specified pattern is matched, without matching it. In this case, it'll only match a -
that isn't surrounded on both sides by numeric characters (the \d
s), but the match will only be the -
itself.
In this case, (?<!\d)
is a negative lookbehind - it will look backwards to see if the immediately preceding string does not match the pattern. If it does, it reports the match as failed and moves on. Likewise, (?!\d)
is a negative lookahead - it does precisely the same thing, but in the opposite direction. Because the -
is sandwiched between them, the effect is "match only a -
if it does not have numeric characters on both sides".