google-bigqueryregexp-replace

REGEXP replace with blank


I am exploring a database using BigQuery and following a relatively old document which is returning an error with the below query. Entries in the column in question has the form |column1| |:--- | |tag1,3125;tag2,024|

Where the numbers after the comma are irrelevant character offsets. The code is trying to 1) delimit rows by the semicolon and 2) remove the comma and character offset and just retain the tag name. The suggested expression is the below but it returns the error: "Syntax error: Unclosed string literal at [3:52]"

SELECT person, COUNT(*) as count
FROM (
select REGEXP_REPLACE(SPLIT(column1,';'), r',.*', ") person
from 'db_name')
group by person
ORDER BY 2 DESC

I have tried several other iterations but all return an error. Any suggestions on how to amend the query?


Solution

  • Fixing the error reported

    I am not sure that SPLIT with REGEX_REPLACE is the way to go. When I corrected the double quote to 2x double quotes, I got an error related to the SPLIT being an array, but you're then trying to work with that as a string.

    with db_name AS (
      SELECT 
          "|column1| |:--- | |tag1,3125;tag2,024|" AS column1
    )
    
    select REGEXP_REPLACE(SPLIT(column1,';'), r',.*',"") person
      from db_name
    

    Error: No matching signature for function REGEXP_REPLACE for argument types: ARRAY<STRING>, STRING, STRING. Supported signatures: REGEXP_REPLACE(STRING, STRING, STRING); REGEXP_REPLACE(BYTES, BYTES, BYTES) at [6:8]

    So even if you overcome the error described, you will still not be able to use REPLACE.

    Dealing with the business problem, not the technical problem

    It seems like you are looking to extract the values from the "tag" values up to (but not including) the comma, then you can extract them as follows.

    1. Clean the data we don't need from the start and end, i.e. anything before the first "tag" value, and the pipe delimiters.
    2. Split the data at the semicolon so we only work with the "tag..." strings. It is treated as an array, so we need to turn it back into a string.
    3. Find the comma in each row and strip that and everything after it.
    4. Group by the person values.

    The following code is broken up so I can explain it along the way. You can consolidate it as required, but it won't be so readable.

    --specify the supplied string example in a CTE
    WITH
      tmp AS (
        SELECT 
          "|column1| |:--- | |tag1,3125;tag2,024|" AS person
      ),
    

    Next we strip out the spurious data up to the first "tag" value and remove any remaining pipe characters (|).

    This changes |column1| |:--- | |tag1,3125;tag2,024| to tag1,3125;tag2,024| and then to tag1,3125;tag2,024, as follows:

      tmp2 AS (
        SELECT 
          REPLACE(SUBSTRING(person,STRPOS(person,"tag"),LENGTH(person)),"|","") AS person
        FROM tmp
      ),
    

    Now we split the value at the semicolon from rows to create an array of tag... values.

      tmp3 AS (
        SELECT 
          SPLIT(person,";") AS person
        FROM tmp2
      ),
    

    This gives us the array of

    tag1,3125
    tag2,024
    

    Next, UNNEST makes the array into rows of strings so we can manipulate them further.

    Find the index position of the comma and take the start of the string up to that index position, but we have to go back 1 character.

    e.g.

      tmp4 AS (
        SELECT 
          SUBSTRING(u,1,STRPOS(u,",")-1) AS person
        FROM tmp3,
          UNNEST(person) AS u
      )
    

    This gives us the string values of

    tag1
    tag2
    

    Finally you can group by the person values:

    SELECT person, COUNT(*) AS cnt
    FROM tmp4
    GROUP by 1
    ORDER by 2 DESC