google-bigquerystitch

How to choose between Views or Scheduled Queries for de-duplicating BigQuery tables imported via Stitch?


I build statistical output generated on-demand from data stored in BigQuery tables. Some data is imported daily via stitch using "Append-Only". This results in duplicated observations in the imported tables (around 20kk rows growing 8kk yearly).

I could either schedule a BigQuery query to store deduplicated values in a cleaned table, or build views to do the same, but I don't understand the tradeoffs in terms of:

Am I correct to assume that daily scheduled queries to store deduplicated data is more costly (for re-writing stored tables) but speeds up later queries to the deduplicated data (saving on costs for usage) ?

The deduplicated data will namely in turn be queried hundreds of times daily to produce dashboard output for which responsiveness is a concern.

How should I argue when deciding for the better solution?


Solution

  • Lets go to the facts:

    1. The price you will pay in the query is the same regardless you are using a View or a Scheduled Query
    2. When using a Scheduled Query, you will need to pay for the data you store in the de-duplicated table. As a View will not store any data, you will not have extra charges.
    3. In terms of speed, using the Scheduled Query approach wins because you have your data already de-duplicated and cleaned. If you are going to feed dashboards with this data, the View approach can lead to laziness in the dashboard loading.
    4. Another possible approach for you is using Materialized Views, which are smarter Views that periodically cache results in order to improve performance. In this guide you can find some information about choosing between Scheduled Queries and Materialized Views:

    When should I use scheduled queries versus materialized views?

    Scheduled queries are a convenient way to run arbitrarily complex calculations periodically. Each time the query runs, it is being run fully. The previous results are not used, and you pay the full price for the query. Scheduled queries are great when you don't need the freshest data and you have a high tolerance for data staleness.

    Materialized views are suited for when you need to query the latest data while cutting down latency and cost by reusing the previously computed result. You can use materialized views as pseudo-indexes, accelerating queries to the base table without updating any existing workflows.

    As a general guideline, whenever possible and if you are not running arbitrarily complex calculations, use materialized views.