sqlregexclickhouse

How to find repeating characters in a string?


I'm trying to find whether any character in a string repeats five or more times in clickhouse cloud. Examples:

'12344444156'
'abcrrrrrggds'

I know the regex that works in general:

.*(.)\1{4,}.*

But clickhouse is using the RE2 engine which does not support backreference. How else can I do this?

I tried:

WITH '12344444156' as str
SELECT str, extract(str, '.*(.)\\1{4,}.*');

Expected output: 12344444156

Got:

SQL Error [427] [07000]: Code: 427. DB::Exception: OptimizedRegularExpression: cannot compile re2: .*(.)\1{4,}.*, error: invalid escape sequence: \1. Look at https://github.com/google/re2/wiki/Syntax for reference. Please note that if you specify regex as an SQL string literal, the slashes have to be additionally escaped. For example, to match an opening brace, write '\(' -- the first slash is for SQL and the second one is for regex: While processing '12344444156' AS str, extract(str, '.*(.)\\1{4,}.*'). (CANNOT_COMPILE_REGEXP) (version 24.6.1.4410 (official build))
, server ClickHouseNode [uri=https://w2z74jyoma.ap-southeast-2.aws.clickhouse.cloud:8443/default, options={use_server_time_zone=false,use_time_zone=false}]@248459710

Solution

  • Try this way:

    WITH [
      ' 123444156   ff', 
      ' 1234444156  gg', 
      'aaa bb cc ddd',
      'aaa bb cc dddd',
      '111 111',
      '    111     111    '] AS strs
    SELECT
        arrayJoin(strs) AS str,
        splitByString('', str) AS char_arr,
        arrayMap((x, index) -> x == char_arr[index - 1] and not match(x, '\W'), char_arr, arrayEnumerate(char_arr)) AS repeat_arr,
        hasSubstr(repeat_arr, [1, 1, 1]) AS is_repeated;
    
    /*
       ┌─str─────────────────┬─char_arr──────────────────────────────────────────────────────────────────────┬─repeat_arr──────────────────────────────┬─is_repeated─┐
    1. │  123444156   ff     │ [' ','1','2','3','4','4','4','1','5','6',' ',' ',' ','f','f']                 │ [0,0,0,0,0,1,1,0,0,0,0,0,0,0,1]         │           0 │
    2. │  1234444156  gg     │ [' ','1','2','3','4','4','4','4','1','5','6',' ',' ','g','g']                 │ [0,0,0,0,0,1,1,1,0,0,0,0,0,0,1]         │           1 │
    3. │ aaa bb cc ddd       │ ['a','a','a',' ','b','b',' ','c','c',' ','d','d','d']                         │ [0,1,1,0,0,1,0,0,1,0,0,1,1]             │           0 │
    4. │ aaa bb cc dddd      │ ['a','a','a',' ','b','b',' ','c','c',' ','d','d','d','d']                     │ [0,1,1,0,0,1,0,0,1,0,0,1,1,1]           │           1 │
    5. │ 111 111             │ ['1','1','1',' ','1','1','1']                                                 │ [0,1,1,0,0,1,1]                         │           0 │
    6. │     111     111     │ [' ',' ',' ',' ','1','1','1',' ',' ',' ',' ',' ','1','1','1',' ',' ',' ',' '] │ [0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0] │           0 │
       └─────────────────────┴───────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────────────┴─────────────┘
    */