I have a large data set in Google BigQuery with millions of rows of dirty data (App tracking) that I am trying to clean up. One of my problems is that the same data got sent to different columns for different events triggered in the App. By this I mean that maybe the country was sent to custom dimension 1 for some events, but to custom dimension 147 for other events. I can't post actual data, but a SELECT * FROM table_with_dirty_data
would produce something like this:
date | session | eventAction | cd001 | cd002 | cd004 | cd005
-----|---------|-------------|-------|----------|----------|-------
1 | 1 | 'event_1' | '1' | 'Pizza' | null | '21'
1 | 1 | 'event_2' | '10' | '25' | 'Pizza' | '14.56'
1 | 1 | 'event_3' | '3.1' | null | '15' | 'France'
1 | 2 | 'event_1' | '6' | 'Burger' | null | '21'
1 | 2 | 'event_2' | '21' | '25' | 'Burger' | '12.6'
The ultimate goal here is to end up with a clean table that can be analyzed. Something like this:
date | session | eventAction | country | vendor | product | price
-----|---------|-------------|----------|----------|----------|-------
1 | 1 | 'event_1' | 'France' | '1' | 'Pizza' | '14.56'
1 | 1 | 'event_2' | 'France' | '1' | 'Pizza' | '14.56'
1 | 1 | 'event_3' | 'France' | '1' | 'Pizza' | '14.56'
1 | 2 | 'event_1' | 'Spain' | '25' | 'Burger' | '12.6'
1 | 2 | 'event_2' | 'Spain' | '25' | 'Burger' | '12.6'
I am aware that some events will need some degree of statistical imputation and data type casting, but for now I am only concerned with getting every variable into its own column. So I've created an auxiliary table (let's call it matrix
) that looks like this:
event_name | variable_1 | variable_2 | variable_3
-----------|------------|------------|----------
'event_1' | 'cd020' | 'cd035' | 'cd120'
'event_2' | 'cd020' | 'cd146' | 'cd056'
'event_3' | 'cd001' | 'cd020' | 'cd035'
and so on, where the value in each cell of column variable_#
is the column name in the table_with_dirty_data
where the information can be found. That is, variable_1
can be found in a column called cd020
for events with event_name
'event_1' and 'even_2', but in a column called cd001
for events with event_name
'event_3'. So basically what matrix
does is to map to which custom dimension each variable was sent for each event.
I have hundreds of different events in table_with_dirty_data
, and matrix
holds all 200 GA custom dimensions, so doing something like
SELECT
CASE
WHEN event_name = 'event_1' THEN cd020
WHEN event_name = 'event_2' THEN cd020
WHEN event_name = 'event_3' THEN cd001
END AS variable_1
, CASE
WHEN event_name = 'event_1' THEN cd035
WHEN event_name = 'event_2' THEN cd146
WHEN event_name = 'event_3' THEN cd020
END AS variable_2
, CASE
WHEN event_name = 'event_1' THEN cd120
WHEN event_name = 'event_2' THEN cd056
WHEN event_name = 'event_3' THEN cd035
END AS variable_3
FROM table_with_dirty_data
would take me a looooong time and is very error-prone. What I am trying to do is to use SELECT
statements that return the column name (cd###
) from the table_with_dirty_data
where the information is stored and the use a WHILE
to loop over all different events. So, for example, with event_name = 'event_1'
it would be something like this:
SELECT
CASE
WHEN event_name = 'event_1'
THEN (SELECT variable_1 FROM matrix WHERE event_name = 'event_1')
END AS variable_1
, CASE
WHEN event_name = 'event_1'
THEN (SELECT variable_2 FROM matrix WHERE event_name = 'event_1')
END AS variable_2
, CASE
WHEN event_name = 'event_1'
THEN (SELECT variable_3 FROM matrix WHERE event_name = 'event_1')
END AS variable_3
FROM table_with_dirty_data
WHERE event_name = 'event_1'
The goal here is that I can then loop over an array with all the event_name
s (something that can be easily achieved). Ultimately I need all events in one table, but I'm ok with having a table for curating each event, as long as it can be done programmatically (something I personally don't know if it's even possible in GBQ...would have to check on that).
The issue is that the SELECT
statements I am using get evaluated to a string, so the result of the queries inside the CASE
clause are string literals. For example, if I were to run it for event = 'event_1',
SELECT variable_1 FROM matrix WHERE event_name = 'event_1'
evaluates to
'cd020'
which then cause the outer query to become
SELECT
CASE
WHEN event_name = 'event_1' THEN 'cd020'
END AS variable_1
, CASE
WHEN event_name = 'event_1' THEN 'cd035'
END AS variable_2
, CASE
WHEN event_name = 'event_1' THEN 'cd120'
END AS variable_3
FROM table_with_dirty_data
WHERE event_name = 'event_1'
which produces a table like this
event_name | variable_1 | variable_2 | variable_3
-----------|------------|------------|----------
'event_1' | 'cd020' | 'cd035' | 'cd120'
'event_1' | 'cd020' | 'cd035' | 'cd120'
'event_1' | 'cd020' | 'cd035' | 'cd120'
instead of returning the value stored in columns cd020
, cd035
or cd120
for each time the event 'event_1' is found.
Does anyone know a way to unquote the result of those inner queries so that when the outer query executes, they are transformed into column names (so 'cd020' becomes cd020
)??
PS: I am also open to a totally different strategy if anyone knows one.
I solved the problem using the EXECUTE IMMEDIATE
clause as recommended by @shollyman. I did it for just one of the events but I believe this answers the original question (scaling it to other events is just a matter of writing a WHILE
loop). I'll go step by step since I'm not using an actual repex.
Step one I declared a variable called event
with the name of the event that the query needs to look for.
DECLARE event STRING DEFAULT 'event_1';
Then I declared a variable that holds the column name where the query needs to look for the variable.
DECLARE variable_name STRING DEFAULT (SELECT variable_1 FROM matrix WHERE event_name = event);
Then I wrote the query as I would normally do, but using the EXECUTE IMMEDIATE
clause. I used triple double-quotes so that I could break it into several lines for readability).
EXECUTE IMMEDIATE CONCAT("""
SELECT
CASE WHEN event_name = '""", event, "' THEN ", variable_name, """ END AS variable_1
FROM table_with_dirty_data
WHERE event_name = '""", event, """'
""");
If others are going to use this, do take note of the lonely single quotes that I used before or after some triple double quotes. I did that because, for example, the declared variable event
, even though it's a string, seems to get concatenated as event
(no single quotes around it) and that breaks the query execution.