apache-sparkpysparkapache-spark-sqlnullmax

Do not ignore NULL in MAX


Using this dataframe:

from pyspark.sql import functions as F
df = spark.createDataFrame([(None,), (1,), (2,)], ['col_name'])
df.show()
# +--------+
# |col_name|
# +--------+
# |    null|
# |       1|
# |       2|
# +--------+

calculating MAX ignores nulls by default:

max = F.max('col_name').alias('col_name')
df.agg(max).show()
# +--------+
# |col_name|
# +--------+
# |       2|
# +--------+

Is there a way to aggregate using MAX, but not ignoring null values? If there's null, it should return null.


Solution

  • We can do this, but it's quite verbose...

    max = F.when(F.expr("every(col_name is not null)"), F.max("col_name")).alias("col_name")
    df.agg(max).show()
    # +--------+
    # |col_name|
    # +--------+
    # |    null|
    # +--------+
    

    Spark 3.5+ (a bit shorter):

    max = F.when(F.every(~F.isnull('col_name')), F.max("col_name")).alias("col_name")
    df.agg(max).show()