pythonregexpython-re

How do I fix this Reg ex so that it matches hyphenated words where the final segment ends in a consonant other than the letter m


I want to match all cases where a hyphenated string (which could be made up of one or multiple hyphenated segments) ends in a consonant that is not the letter m.

In other words, it needs to match strings such as: 'crack-l', 'crac-ken', 'cr-ca-cr-cr' etc. but not 'crack' (not hyphenated), 'br-oom' (ends in m), br -oo (last segment ends in vowel) or cr-ca-cr-ca (last segment ends in vowel).

It is mostly successful except for cases where there is more than one hyphen, then it will return part of the string such as 'cr-ca-cr' instead of the whole string which should be 'cr-ca-cr-ca'.

Here is the code I have tried with example data:

import re
dummy_data = """ 
broom  
br-oom
br-oo
crack
crack-l
crac-ken
crack-ed
cr-ca-cr-ca
cr-ca-cr-cr
cr-ca-cr-cr-cr
"""
pattern = r'\b(?:\w+-)+\w*[bcdfghjklnpqrstvwxyz](?<!m)\b'
final_consonant_hyphenated = [
    m.group(0)
    for m in re.finditer(pattern, dummy_data, flags=re.IGNORECASE)
]
print(final_consonant_hyphenated)`

expected output:

['crack-l', 'crac-ken', 'crack-ed', 'cr-ca-cr-cr', 'cr-ca-cr-cr-cr']

current output:

 ['crack-l', 'crac-ken', 'crack-ed', **'cr-ca-cr'**, 'cr-ca-cr-cr', 'cr-ca-cr-cr-cr']

(bold string is an incorrect match as it's part of the cr-ca-cr-ca string where the final segment ends in a vowel not a consonant).


Solution

  • You could add a negative lookahead to prevent having a hyphen after and also an idea to shorten [bcdfghjklnpqrstvwxyz](?<!m) to [a-z](?<![aeioum]).

    Update: Further as @Thefourthbird mentioned in the comments, as well putting the lookbehind after the word-boundary \b will result in better performance (fewer steps).

    \b(?:\w+-)+\w*[a-z]\b(?<![aeioum])(?!-)
    

    See this demo at regex101 or even \b(?:\w+-)+\w+\b(?<![aeioum\d_])(?!-) (without the [a-z], using \w+ instead of \w* and also disallowing digits and underscore from \w in the lookbehind). With a possessive quantifier (using PyPI) further reduced: \b(?:\w+-)+\w++(?<![aeioum\d_])(?!-)