sqlregexsnowflake-cloud-data-platform

Snowflake SQL Query to find records where character repeats at least 5 times in a row


I'm trying to query a table to find all instances where a character repeats at least 5 times in a row.

I've tried:

Select Column
From Table
where Column REGEXP '(.)\1{4,}'

but it returns nothing.

The table includes the following entries that SHOULD be returned:

1.111111111111E31
00000000000000000
xxxxxxxxxxxxxxxxx

Solution

  • Using backreferences

    Snowflake does not support backreferences in regular expression patterns (known as “squares” in formal language theory); however, backreferences are supported in the replacement string of the REGEXP_REPLACE function.

    Second when the backslash is used inside single-quoted string it has to be escaped:

    In single-quoted string constants, you must escape the backslash character in the backslash-sequence. For example, to specify \d, use \\d


    Possible workaround is to write a custom UDF:

    CREATE OR REPLACE FUNCTION regexp_test(r TEXT, t TEXT)
      RETURNS BOOLEAN
      LANGUAGE PYTHON
      STRICT
      RUNTIME_VERSION = '3.12'
      HANDLER = 'main'
    AS $$
    import re
    def main(r,t):
      return bool(re.search(r,t))
    $$;
    
    
    WITH cte(Col) AS (
        SELECT '1.111111111111E31'  UNION ALL
        SELECT '00000000000000000'  UNION ALL
        SELECT 'xxxxxxxxxxxxxxxxx'  UNION ALL
        SELECT '12345678'
    )
    SELECT *, regexp_test('(.)\\1{4,}', col)
    FROM cte;
    

    Output:

    +-------------------+--------------------------------+
    |        COL        | REGEXP_TEST('(.)\\1{4,}', COL) |
    +-------------------+--------------------------------+
    | 1.111111111111E31 | TRUE                           |
    | 00000000000000000 | TRUE                           |
    | xxxxxxxxxxxxxxxxx | TRUE                           |
    | 12345678          | FALSE                          |
    +-------------------+--------------------------------+