I build statistical output generated on-demand from data stored in BigQuery tables. Some data is imported daily via stitch using "Append-Only". This results in duplicated observations in the imported tables (around 20kk rows growing 8kk yearly).
I could either schedule a BigQuery query to store deduplicated values in a cleaned table, or build views to do the same, but I don't understand the tradeoffs in terms of:
Am I correct to assume that daily scheduled queries to store deduplicated data is more costly (for re-writing stored tables) but speeds up later queries to the deduplicated data (saving on costs for usage) ?
The deduplicated data will namely in turn be queried hundreds of times daily to produce dashboard output for which responsiveness is a concern.
How should I argue when deciding for the better solution?
Lets go to the facts:
When should I use scheduled queries versus materialized views?
Scheduled queries are a convenient way to run arbitrarily complex calculations periodically. Each time the query runs, it is being run fully. The previous results are not used, and you pay the full price for the query. Scheduled queries are great when you don't need the freshest data and you have a high tolerance for data staleness.
Materialized views are suited for when you need to query the latest data while cutting down latency and cost by reusing the previously computed result. You can use materialized views as pseudo-indexes, accelerating queries to the base table without updating any existing workflows.
As a general guideline, whenever possible and if you are not running arbitrarily complex calculations, use materialized views.