grepwildcardbackreference

GNU grep, backreferences and wildcards


Using grep (GNU grep 3.3) to search for all words with three consecutive double-letters (resulting in "bookkeeper"):

grep -E "((.)\2){3}" /usr/share/dict/american-english

Changing this to search for words with three double-letters, each of them followed by the letter "i" (resulting in "Mississippi"):

grep -E "((.)\2i){3}" /usr/share/dict/american-english

Changing this to search for words with three double-letters, each of them followed by any single letter (with a couple of results):

grep -E "((.)\2.){3}" /usr/share/dict/american-english

Changing this to search for words with three double-letters separated by an optional, single letter (even more results):

grep -E "((.)\2.?){3}" /usr/share/dict/american-english

Now, finally, my original task: Search for all words containing three double-letters:

grep -E "((.)\2.*){3}" /usr/share/dict/american-english

But this results in an empty set. Why? How can .? match something .* does not?


Solution

  • The POSIX regex engine does not handle patterns with back-references well, matching back references is an NP complete problem might provide some hints on why it is that difficult.

    Since you are using a GNU grep, the problem is easily solved with the PCRE engine,

    grep -P '((.)\2.*){3}' file
    

    since the PCRE engine can handle back-references in a more efficient way than the POSIX regex engine.

    See the online demo.