I have a string which contains repeated substrings like
The big black dog big black dog is a friendly friendly dog who who lives nearby nearby
How can I remove these in BQ so that the result looks like this
The big black dog is a friendly dog who lives nearby
I tried with regex capturing groups like in this question but no luck so far, it just returns the original phrase
with text as (
select "The big black dog big black dog is a friendly
friendly dog who who lives nearby nearby" as phrase
)
select REGEXP_REPLACE(phrase,r'([ \\w]+)\\1',r'$1') from text
Edit: assuming the repeated word or phrase follows right after the first instance of a word or phrase
i.e. dog dog
is considered a repetition of dog
,
but dog good dog
is not
similarly for phrases:
good dog good dog
is repetition of good dog
but good dog catch ball good dog
is not considered a repetition
In addition to @Mikhail's wonderful SQL-ish approach, you might consider below regexp JS UDF approach.
regular expression explanation
CREATE TEMP FUNCTION dedup_repetition(s STRING) RETURNS STRING LANGUAGE js AS
"""
const re = /(.{2,}) (?=\\1)/g;
return s.replace(re, '');
""";
WITH sample_data AS (
SELECT 'The big black dog big black dog is a friendly friendly dog who who lives nearby nearby' phrase UNION ALL
SELECT 'good dog good dog' UNION ALL
SELECT 'good dog good dog good dog' UNION ALL
SELECT 'good dog good dog good' UNION ALL
SELECT 'dog dog dog' UNION ALL
SELECT 'good dog catch ball good dog'
)
SELECT dedup_repetition(phrase) deduped_phrase FROM sample_data;