I'm trying to find whether any character in a string repeats five or more times in clickhouse cloud. Examples:
'12344444156'
'abcrrrrrggds'
I know the regex that works in general:
.*(.)\1{4,}.*
But clickhouse is using the RE2 engine which does not support backreference. How else can I do this?
I tried:
WITH '12344444156' as str
SELECT str, extract(str, '.*(.)\\1{4,}.*');
Expected output:
12344444156
Got:
SQL Error [427] [07000]: Code: 427. DB::Exception: OptimizedRegularExpression: cannot compile re2: .*(.)\1{4,}.*, error: invalid escape sequence: \1. Look at https://github.com/google/re2/wiki/Syntax for reference. Please note that if you specify regex as an SQL string literal, the slashes have to be additionally escaped. For example, to match an opening brace, write '\(' -- the first slash is for SQL and the second one is for regex: While processing '12344444156' AS str, extract(str, '.*(.)\\1{4,}.*'). (CANNOT_COMPILE_REGEXP) (version 24.6.1.4410 (official build))
, server ClickHouseNode [uri=https://w2z74jyoma.ap-southeast-2.aws.clickhouse.cloud:8443/default, options={use_server_time_zone=false,use_time_zone=false}]@248459710
Try this way:
WITH [
' 123444156 ff',
' 1234444156 gg',
'aaa bb cc ddd',
'aaa bb cc dddd',
'111 111',
' 111 111 '] AS strs
SELECT
arrayJoin(strs) AS str,
splitByString('', str) AS char_arr,
arrayMap((x, index) -> x == char_arr[index - 1] and not match(x, '\W'), char_arr, arrayEnumerate(char_arr)) AS repeat_arr,
hasSubstr(repeat_arr, [1, 1, 1]) AS is_repeated;
/*
┌─str─────────────────┬─char_arr──────────────────────────────────────────────────────────────────────┬─repeat_arr──────────────────────────────┬─is_repeated─┐
1. │ 123444156 ff │ [' ','1','2','3','4','4','4','1','5','6',' ',' ',' ','f','f'] │ [0,0,0,0,0,1,1,0,0,0,0,0,0,0,1] │ 0 │
2. │ 1234444156 gg │ [' ','1','2','3','4','4','4','4','1','5','6',' ',' ','g','g'] │ [0,0,0,0,0,1,1,1,0,0,0,0,0,0,1] │ 1 │
3. │ aaa bb cc ddd │ ['a','a','a',' ','b','b',' ','c','c',' ','d','d','d'] │ [0,1,1,0,0,1,0,0,1,0,0,1,1] │ 0 │
4. │ aaa bb cc dddd │ ['a','a','a',' ','b','b',' ','c','c',' ','d','d','d','d'] │ [0,1,1,0,0,1,0,0,1,0,0,1,1,1] │ 1 │
5. │ 111 111 │ ['1','1','1',' ','1','1','1'] │ [0,1,1,0,0,1,1] │ 0 │
6. │ 111 111 │ [' ',' ',' ',' ','1','1','1',' ',' ',' ',' ',' ','1','1','1',' ',' ',' ',' '] │ [0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0] │ 0 │
└─────────────────────┴───────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────────────┴─────────────┘
*/