We have two approaches to selecting and filtering data from spark data frame df
. First:
df = df.filter("filter definition").select('col1', 'col2', 'col3')
and second:
df = df.select('col1', 'col2', 'col3').filter("filter definition")
Suppose we want to call the action of count
after that.
Which one is more performant if we can swap the place of filter
and select
in spark (i.e., in the definition of the filter
we used from the selected columns and not more)? Why? Is there any difference between the filter
and select
swapping for different actions?
Spark ( in and above 1.6 version) uses catalyst optimiser for queries, so less performant query will be transformed to the efficient one.
Just to confirm you can call explain(true) on dataframe to check its optimised plan which are the same for both the queries.
PS: New changes are introduction of cost based optimiser.