I would like to query a table which has few columns (such as 4 below) from where i would like to create a snapshot table which can provide insight on the attribute count everyday
userId | attr1 | attr2 | attr3 |
---|---|---|---|
1 | true | Hello | Dog |
2 | false | Hi | |
3 | Hello | ||
4 | false | Cat | |
5 | true | Hi | Dog |
6 | false | Cat | |
7 | Hi | ||
8 | false | Dog | |
9 | Cat | ||
10 | false | Hello | Cat |
Such as from above
userIdCount | attrName | dateWhenSnapshotInserted |
---|---|---|
7 | attr1 | April 16, 2024 |
6 | attr2 | April 16, 2024 |
7 | attr3 | April 16, 2024 |
In this case, we may have millions of rows in original table with more than 100 columns and we would like to populate snapshot table everyday with count of user with non null values for each attribute and date when this snapshot created.
I am framing my query as below :-
select COUNT(DISTINCT userId) as userIdCount from orig_table where
attr1 IS NOT NULL UNION Select COUNT(DISTINCT userId) as userIdCount
from orig_table where attr2 IS NOT NULL UNION Select COUNT(DISTINCT
userId) as userIdCount from orig_table where attr3 IS NOT NULL;
Is there a optimized way to write multiple select query together. I started with creating multiple select query and inserting it to output table one by one but that seems like a bad pattern. Please suggest how to achieve this so as this can be scaled.
ps: Newbie (who is exploring DB queries first time)
I think you mean a materialized view
. A regular view
would recalculate the query on each select
from it, whereas a materialized view
keeps a snapshot of the result until you refresh
it.
The multiple selects part of the problem can be handled with dynamic SQL in a PL/pgSQL block.
do $f$
declare column_name_ text;
dynamic_sql_query text;
dynamic_sql_query_per_table text:=$$SELECT COUNT(DISTINCT "userId") AS %1$I
,'%2$I' as "attrName"
,now() AS "dateWhenSnapshotInserted"
FROM orig_table
WHERE %2$I IS NOT NULL$$;
begin
for column_name_ in select column_name
from information_schema.columns
where table_name='orig_table'
loop
dynamic_sql_query:=concat_ws(' UNION ALL '
,dynamic_sql_query
,format( dynamic_sql_query_per_table
,column_name_||'Count'
,column_name_));
end loop;
execute format('CREATE MATERIALIZED VIEW orig_table_counts AS %s',dynamic_sql_query);
end $f$;
select * from orig_table_counts;
userIdCount | attrName | dateWhenSnapshotInserted |
---|---|---|
10 | "userId" | 2024-04-16 09:07:44.604273+00 |
7 | attr1 | 2024-04-16 09:07:44.604273+00 |
6 | attr2 | 2024-04-16 09:07:44.604273+00 |
7 | attr3 | 2024-04-16 09:07:44.604273+00 |
concat_ws()
takes care of putting a union all
between each query that collects counts for each column. It skips null
arguments, so having dynamic_sql_query
initially declared but unpopulated prevents a leading/trailing union
separator.$$
are quotes that let you use other types of quotes inside them. %1$I
and %2$I
are format()
placeholders, instructing it to insert the 1st/2nd argument as an optionally double-quoted I
dentifier. %s
tells it to put another one, as a plain string, without any special handling."
double-quotes, they all get folded to lowercase, meaning that userId
might in reality be called userid
.