pythonregexregex-alternation

Regex: Alternators order issue


When using alternation in regex, we should include items in the alternators in order to avoid being affected by eagerness of the engine.

Then if having a list such as co,co.,co-op,association,assoc we should prefer to include them in order to get the most precise match. Then, this should be changed to association,assoc,co-op,co.,co.

I have a basic regex pattern to split a word in two if hyphen or slash is included, so I get just the part before the hyphen or slash:

(.*(?<!\w)(CO-OP|CO|CO.)(?!\w).*)[-/](\s*\w+.*)

However, this regex is breaking incorrectly when providing ABC CO-OP ELEMENTARY SCHOOL. This string is becoming just ABC CO. However, if I remove CO from the alternators, the string is returned in its original form ABC CO-OP ELEMENTARY SCHOOL which is correct. In addition, the string ARMSTRONG CO-OP ELEMENTARY SCHOOL / ECOLE PRIMAIRE ARMSTRONG COOPERATIVE should be broken to become ARMSTRONG CO-OP ELEMENTARY SCHOOL without the string after slash.

Why CO is matched in the alternators and used to break the string?


Solution

  • Your issue is that your regex requires there to be a - or a \ in the string, so it is forcing ABC CO-OP ELEMENTARY SCHOOL to split on the - in CO-OP. If you:

    1. make the second part of the regex optional;
    2. change the .* at the end of the first group to be lazy (.*?); and
    3. add start and end-of-string anchors

    you will get the results you want:

    ^(.*(?<!\w)(?:CO-OP|CO|CO\.)(?!\w).*?)(?:[-/](\s*\w+.*))?$
    

    Demo on regex101

    Note also that the . in CO. should be escaped.