pythonpython-refindall

How do I write a regular expression to find all words which have 2 or more of the same consonant in a sequence


I'm trying to write a regular expression to find all words which contain a sequence of 2 or more of the same consonants.

I have tried the following but it is broken:

xh_data = ("mmh tshhu itshu mama krrrr")
onomat_consonant_words = re.findall (r'\b\w*([b-df-hj-np-tv-z])\1\w*\b', xh_data, flags=re.IGNORECASE)

print (onomat_consonant_words)

It should give the following output ['mmh', 'tshhu', 'krrr'] it currently just gives ['m','h','r']

Trying to use back referencing with the \1 but I am not sure I am doing it correctly here.


Solution

  • There are two issues here:

    One solution is to use finditer and extract the complete match:

    onomat_consonant_words = [
        m[0]
        for m in re.finditer(r'\b\w*([b-df-hj-np-tv-z])\1\w*', xh_data, flags=re.IGNORECASE)
    ]
    

    Note that you don't really need the final \b. It is implied by the greedy \w*.