sqlapache-sparkgroup-bynullapache-spark-sql

spark dataframe groupping does not count nulls


I have a spark DataFrame which is grouped by a column aggregated with a count:

df.groupBy('a').agg(count("a")).show()

+---------+----------------+
|a        |count(a)        |
+---------+----------------+
|     null|               0|
|      -90|           45684|
+---------+----------------+


df.select('a').filter(col('a').isNull()).count()

returns

warning: there was one feature warning; re-run with -feature for details
res9: Long = 26834

which clearly shows that the null values were not counted initially.

What is the reason for this behaviour? I would have expected (if nullat all is contained in the grouping result) to properly see the counts.


Solution

  • Yes, count applied to a specific column does not count the null values. If you want to include the null values, use:

    df.groupBy('a').agg(count("*")).show()