scalaapache-sparkapache-spark-sqlduplicates

Remove all records which are duplicate in spark dataframe


I have a spark dataframe with multiple columns in it. I want to find out and remove rows which have duplicated values in a column (the other columns can be different).

I tried using dropDuplicates(col_name) but it will only drop duplicate entries but still keep one record in the dataframe. What I need is to remove all entries which were initially containing duplicate entries.

I am using Spark 1.6 and Scala 2.10.


Solution

  • I would use window-functions for this. Lets say you want to remove duplicate id rows :

    import org.apache.spark.sql.expressions.Window
    
    df
      .withColumn("cnt", count("*").over(Window.partitionBy($"id")))
      .where($"cnt"===1).drop($"cnt")
      .show()