I am exploring a database using BigQuery and following a relatively old document which is returning an error with the below query. Entries in the column in question has the form |column1| |:--- | |tag1,3125;tag2,024|
Where the numbers after the comma are irrelevant character offsets. The code is trying to 1) delimit rows by the semicolon and 2) remove the comma and character offset and just retain the tag name. The suggested expression is the below but it returns the error: "Syntax error: Unclosed string literal at [3:52]"
SELECT person, COUNT(*) as count
FROM (
select REGEXP_REPLACE(SPLIT(column1,';'), r',.*', ") person
from 'db_name')
group by person
ORDER BY 2 DESC
I have tried several other iterations but all return an error. Any suggestions on how to amend the query?
Fixing the error reported
I am not sure that SPLIT
with REGEX_REPLACE
is the way to go. When I corrected the double quote to 2x double quotes, I got an error related to the SPLIT being an array, but you're then trying to work with that as a string.
with db_name AS (
SELECT
"|column1| |:--- | |tag1,3125;tag2,024|" AS column1
)
select REGEXP_REPLACE(SPLIT(column1,';'), r',.*',"") person
from db_name
Error:
No matching signature for function REGEXP_REPLACE for argument types: ARRAY<STRING>, STRING, STRING. Supported signatures: REGEXP_REPLACE(STRING, STRING, STRING); REGEXP_REPLACE(BYTES, BYTES, BYTES) at [6:8]
So even if you overcome the error described, you will still not be able to use REPLACE
.
Dealing with the business problem, not the technical problem
It seems like you are looking to extract the values from the "tag" values up to (but not including) the comma, then you can extract them as follows.
The following code is broken up so I can explain it along the way. You can consolidate it as required, but it won't be so readable.
--specify the supplied string example in a CTE
WITH
tmp AS (
SELECT
"|column1| |:--- | |tag1,3125;tag2,024|" AS person
),
Next we strip out the spurious data up to the first "tag" value and remove any remaining pipe characters (|
).
This changes |column1| |:--- | |tag1,3125;tag2,024|
to tag1,3125;tag2,024|
and then to tag1,3125;tag2,024
, as follows:
tmp2 AS (
SELECT
REPLACE(SUBSTRING(person,STRPOS(person,"tag"),LENGTH(person)),"|","") AS person
FROM tmp
),
Now we split the value at the semicolon from rows to create an array of tag... values.
tmp3 AS (
SELECT
SPLIT(person,";") AS person
FROM tmp2
),
This gives us the array of
tag1,3125
tag2,024
Next, UNNEST
makes the array into rows of strings so we can manipulate them further.
Find the index position of the comma and take the start of the string up to that index position, but we have to go back 1 character.
e.g.
STRPOS("tag123,xyz", ",")
returns 7 as the comma is in index position 7.STRPOS("tag123,xyz", ",")-1
returns 6SUBSTRING("tag123,xyz",1,STRPOS("tag123,xyz", ",")
returns index positions 1-7, i.e. tag123,
SUBSTRING("tag123,xyz",1,STRPOS("tag123,xyz", ",")-1)
returns index positions 1-6, i.e. tag123
tmp4 AS (
SELECT
SUBSTRING(u,1,STRPOS(u,",")-1) AS person
FROM tmp3,
UNNEST(person) AS u
)
This gives us the string values of
tag1
tag2
Finally you can group by the person values:
SELECT person, COUNT(*) AS cnt
FROM tmp4
GROUP by 1
ORDER by 2 DESC