apache-sparkbigdataparquet

What is the difference between "predicate pushdown" and "projection pushdown"?


I have come across several sources of information, such as the one found here, which explain "predicate pushdown" as :

… if you can “push down” parts of the query to where the data is stored, and thus filter out most of the data, then you can greatly reduce network traffic.

However, I have also seen the term "projection pushdown" in other documentation such as here, which appears to be the same thing but I am not sure in my understanding.

Is there a specific difference between the two terms?


Solution

  • Predicate refers to the where/filter clause which affects the amount of rows returned.

    Projection refers to the selected columns.

    For example:

    If your filters pass only 5% of the rows, only 5% of the table will be passed from the storage to Spark instead of the full table.

    If your projection selects only 3 columns out of 10, then less columns will be passed from the storage to Spark and if your storage is columnar (e.g. Parquet, not Avro) and the non selected columns are not a part of the filter, then these columns won't even have to be read.