When using alternation in regex, we should include items in the alternators in order to avoid being affected by eagerness of the engine.
Then if having a list such as co,co.,co-op,association,assoc
we should prefer to include them in order to get the most precise match. Then, this should be changed to association,assoc,co-op,co.,co
.
I have a basic regex pattern to split a word in two if hyphen or slash is included, so I get just the part before the hyphen or slash:
(.*(?<!\w)(CO-OP|CO|CO.)(?!\w).*)[-/](\s*\w+.*)
However, this regex is breaking incorrectly when providing ABC CO-OP ELEMENTARY SCHOOL
. This string is becoming just ABC CO
. However, if I remove CO from the alternators, the string is returned in its original form ABC CO-OP ELEMENTARY SCHOOL
which is correct. In addition, the string ARMSTRONG CO-OP ELEMENTARY SCHOOL / ECOLE PRIMAIRE ARMSTRONG COOPERATIVE
should be broken to become ARMSTRONG CO-OP ELEMENTARY SCHOOL
without the string after slash.
Why CO
is matched in the alternators and used to break the string?
Your issue is that your regex requires there to be a -
or a \
in the string, so it is forcing ABC CO-OP ELEMENTARY SCHOOL
to split on the -
in CO-OP
. If you:
.*
at the end of the first group to be lazy (.*?
); andyou will get the results you want:
^(.*(?<!\w)(?:CO-OP|CO|CO\.)(?!\w).*?)(?:[-/](\s*\w+.*))?$
Note also that the .
in CO.
should be escaped.