google-bigqueryre2

BigQuery - remove duplicate substrings from string


I have a string which contains repeated substrings like

The big black dog big black dog is a friendly friendly dog who who lives nearby nearby

How can I remove these in BQ so that the result looks like this

The big black dog is a friendly dog who lives nearby

I tried with regex capturing groups like in this question but no luck so far, it just returns the original phrase

with text as (
  select "The big black dog big black dog is a friendly 
        friendly dog who who lives nearby nearby" as phrase
)

select REGEXP_REPLACE(phrase,r'([ \\w]+)\\1',r'$1') from text

Edit: assuming the repeated word or phrase follows right after the first instance of a word or phrase

i.e. dog dog is considered a repetition of dog, but dog good dog is not

similarly for phrases:

good dog good dog is repetition of good dog

but good dog catch ball good dog is not considered a repetition


Solution

  • In addition to @Mikhail's wonderful SQL-ish approach, you might consider below regexp JS UDF approach.

    regular expression explanation

    CREATE TEMP FUNCTION dedup_repetition(s STRING) RETURNS STRING LANGUAGE js AS
    """
      const re = /(.{2,}) (?=\\1)/g;
      return s.replace(re, '');
    """;
    
    WITH sample_data AS (
      SELECT 'The big black dog big black dog is a friendly friendly dog who who lives nearby nearby' phrase UNION ALL
      SELECT 'good dog good dog' UNION ALL
      SELECT 'good dog good dog good dog' UNION ALL
      SELECT 'good dog good dog good' UNION ALL
      SELECT 'dog dog dog' UNION ALL
      SELECT 'good dog catch ball good dog'
    )
    SELECT dedup_repetition(phrase) deduped_phrase FROM sample_data;
    

    enter image description here