sqlgoogle-bigquerydbtincremental

DBT incremental models process too much data


I'm currently working on Datawarehousing some Google Searchconsole data into a bigquery.

Since GSC is a lot of data I wanna use incremental models for each day. My models look something like this:

with
  raw_data as (
    select * from {{source('searchconsole', 'searchconsole')}}
    {%- if is_incremental() -%}

    -- this filter will only be applied on an incremental run
    where data_date > (select max(data_date) from {{ this }})

    {%- endif -%}
  ),

  further_transformations as (
    select * from ...
  )

select * from further_transformations

And their config then looks like this:

dbt_project.yml:

(...)

models:
  project_name:
    +materialized: incremental
    +incremental_strategy: insert_overwrite

Local properties.yml:

version: 2

models:
  - name: model_xy
    description: Description for model xy
    config:
      partition_by:
        field: 'data_date'
        data_type: 'date'
        granularity: 'day'

The output from DBT states, that on an incremental run, data is processed in the order of 10s of GB:

12:35:32  1 of 6 START sql incremental model model_xy .......... [RUN]
12:36:13  1 of 6 OK created sql incremental model model_xy ..... [SCRIPT (32.0 GB processed) in 40.68s]

This seems way too much for a run, where in the end, no rows should be added (when I run it right now). There seems to be no significant benefit between a full refresh run and an incremental run. The incremental run even takes longer. From my best understanding of incremental models this is exactly how it should be done, am I missing something? Is there a problem with filtering the data early like this in CTEs?

I've tried to look for people with similar Problems, but it seems like people just generally aren't sure how to use incremental models.


Solution

  • While I'm still confused as to why dbt processes these huge amounts of data, this seems to be correct.

    On the other hand I've found another gotcha, that was rather stupid from me:

    Since online articles always told me to filter the new increment as early as possible and my models all depend on each other, I assumed that I could just filter the data in an upstream model and this would also filter data for downstream models.

    This of course is wrong, because in downstream models you still use the entire table and not just the new partition that was added in the current run.

    Implementing the is_incremental filter on every model therefore increased the performance by a mile.


    A last general tip from my side is to create a model that manages the filter condition for the latest partition, e.g. the maximum date of the partitions in your DB. Every other model can then use this model to filter for an incremental run. This helped me in certain cases, when a model doesn't have a column to look up the latest existing partition, e.g. in a relation table that consists of two foreign keys.

    To make sure that this model runs before every other model, but doesn't create a cycle I configured one of my fact tables as a source (Might cause error on the first run, when fact table doesn't exist yet). Then I referenced the model in my most upstream model within a comment like this:

    -- depends_on: {{ ref('incremental_log') }}