phpregexduplicatescpu-wordsanitization

Remove duplicated words from a space delimited string


I have string:

$s = 'Артгалерея Живопись Африка и от the Albert$Lizah, L-77, Christ UF1.1 (Christ).';

I wish to receive in a array the next string:

$s = 'Артгалерея Живопись Африка Albert Lizah Christ';

I used regex:

   preg_match_all('#\pL{4,}+#iu', $s, $m);
   $m = preg_replace("/\b(\w+)\s+\\1\b/i", "$1", implode(' ',$m[0]));
   $m = explode(' ', $m);
   echo '<pre>'.print_r($m, 1).'</pre>';

And received:

$s = 'Артгалерея Живопись Африка Albert Lizah Christ Christ';

But I can not receive a string without duplicating words.

Question: How to change regular expression php - #\pL{4,}+#iu, to exclude inclusion in a string of duplicating words?


Solution

  • Use a negative lookahead assertion with a backreference:

        \b(\pL{4,}+)\b(?!.*\b\1\b)