apache-sparkapache-spark-sql

Spark - SELECT WHERE or filtering?


What's the difference between selecting with a where clause and filtering in Spark?
Are there any use cases in which one is more appropriate than the other one?

When do I use

DataFrame newdf = df.select(df.col("*")).where(df.col("somecol").leq(10))

and when is

DataFrame newdf = df.select(df.col("*")).filter("somecol <= 10")

more appropriate?


Solution

  • According to spark documentation, where() is an alias for filter().

    filter(condition) Filters rows using the given condition. where() is an alias for filter().

    Parameters: condition – a Column of types.BooleanType or a string of SQL expression.

    >>> df.filter(df.age > 3).collect()
    [Row(age=5, name=u'Bob')]
    >>> df.where(df.age == 2).collect()
    [Row(age=2, name=u'Alice')]
    
    >>> df.filter("age > 3").collect()
    [Row(age=5, name=u'Bob')]
    >>> df.where("age = 2").collect()
    [Row(age=2, name=u'Alice')]