I want to use a regular expression to remove duplicate character sequences (words) from a string. My question is similar to the one entitled regular expression for duplicate words, but I have some additional requirements.
I need to include additional characters. The accepted answer to the linked question only detects words consisting of alphanumeric characters, but I need to include symbol characters such as “@” in my definition of a word.
I need to match multiple repetitions of a pattern. If a word is repeated three times, the accepted answer to the linked question only removes one of the duplicates, but I need to remove both of them.
Here is the sample string I am using for testing:
hello me now @@@ @@@ @@@ then method me @@@
My desired result is:
hello me now @@@ then method me @@@
The keys to solving this are:
\s
) and non-white-space (\S
).Here is the regex you need: /(?<=(\S+)\s+)\1\s+/g
Here is a demonstration of it working.
Here is a screenshot of the demonstration.
Now I will explain the process of creating this regular expression. First, let’s state the goal. The goal is to match any word which is the same as the previous word, so that we can strip it, that is, replace it with nothing. So let’s step through the process:
\w+
, but that only matches alphanumeric characters. Instead, use \S+
which matches all characters which are not considered white space. Note that it matches “@@@” as well as the ordinary words.(?<= ... )
, looking for a word \S+
followed by white space \s+
. You can see in the screenshot that the very first word in the string is no longer matched. Perfect.\S+
inside the lookbehind expression), then refer to that captured group in our match (replacing our original \S+
with \1
).\s+
to the end of it. That brings us to the final result, which I illustrated at the beginning of this answer./(?<=(\S+)\s+)\1\s+/g