apache-spark

Difference between describe() and summary() in Apache Spark


What is the difference between summary() and describe() ?

It seems that they both serve the same purpose. I didn't manage to find any differences (if any).


Solution

  • If we are passing any args then these functions works for different purposes:

    .describe() function takes cols:String*(columns in df) as optional args.

    .summary() function takes statistics:String*(count,mean,stddev..etc) as optional args.

    Example:

    scala> val df_des=Seq((1,"a"),(2,"b"),(3,"c")).toDF("id","name")
    scala> df_des.describe().show(false) //without args
    //Result:
    //+-------+---+----+
    //|summary|id |name|
    //+-------+---+----+
    //|count  |3  |3   |
    //|mean   |2.0|null|
    //|stddev |1.0|null|
    //|min    |1  |a   |
    //|max    |3  |c   |
    //+-------+---+----+
    scala> df_des.summary().show(false) //without args
    //+-------+---+----+
    //|summary|id |name|
    //+-------+---+----+
    //|count  |3  |3   |
    //|mean   |2.0|null|
    //|stddev |1.0|null|
    //|min    |1  |a   |
    //|25%    |1  |null|
    //|50%    |2  |null|
    //|75%    |3  |null|
    //|max    |3  |c   |
    //+-------+---+----+
    scala> df_des.describe("id").show(false) //descibe on id column only
    //+-------+---+
    //|summary|id |
    //+-------+---+
    //|count  |3  |
    //|mean   |2.0|
    //|stddev |1.0|
    //|min    |1  |
    //|max    |3  |
    //+-------+---+
    scala> df_des.summary("count").show(false) //get count summary only
    //+-------+---+----+
    //|summary|id |name|
    //+-------+---+----+
    //|count  |3  |3   |
    //+-------+---+----+