We have two approaches to selecting and filtering data from spark data frame df. First:
df = df.filter("filter definition").select('col1', 'col2', 'col3')
and second:
df = df.select('col1', 'col2', 'col3').filter("filter definition")
Suppose we want to call the action of count after that.
Which one is more performant if we can swap the place of filter and select in spark (i.e., in the definition of the filter we used from the selected columns and not more)? Why? Is there any difference between the filter and select swapping for different actions?
Spark ( in and above 1.6 version) uses catalyst optimiser for queries, so less performant query will be transformed to the efficient one.
Just to confirm you can call explain(true) on dataframe to check its optimised plan which are the same for both the queries.
PS: New changes are introduction of cost based optimiser.