pythonregexsearchregexp-substr

finding duplicate words throughout the text using regex


I wanted to find all sets of words that are repeated in a text. for example:

string21="we read all sort of books, we read sci-fi books, historical books, advanture books and etc."

Now regex should output these words: we,read,books

How can I get this result?

I tried using this:

pattern="\b(\w+)\s+\1\b"
match=re.findall(pattern,string)

But it didn't work as I expected, and only showed 2 duplicate words exactly next to each other and doesn't search the whole text.


Solution

  • Your attempt doesn't account other words between repeated ones.

    To account for them you can use regex \b(\w+)\b(?=.*\b\1\b). It matches words that are followed by same word somewhere later in input string.

    Notice that re.findall will return book three times, as it repeats in input four times (last is not returned as it is not followed by word book anywhere). To accommodate this we can use conversion into set and then back into list.

    import re
    string21="we read all sort of books, we read sci-fi books, historical books, advanture books and etc."
    list(set(re.findall(r'\b(\w+)\b(?=.*\b\1\b)', string21)))
    # ['we', 'books', 'read']
    

    Important: this regex will find only repetition of words that consist of [a-zA-Z0-9_]. If there is need to include some additional symbols (for example ', to accommodate words like "you're"), your regex should be a bit more complex:

    (?<![\w'])([\w']+)(?![\w'])(?=.*(?<![\w'])\1(?![\w']))