[SOLVED] Do not ignore NULL in MAX

Do not ignore NULL in MAX

Using this dataframe:

from pyspark.sql import functions as F
df = spark.createDataFrame([(None,), (1,), (2,)], ['col_name'])
df.show()
# +--------+
# |col_name|
# +--------+
# |    null|
# |       1|
# |       2|
# +--------+

calculating MAX ignores nulls by default:

max = F.max('col_name').alias('col_name')
df.agg(max).show()
# +--------+
# |col_name|
# +--------+
# |       2|
# +--------+

Is there a way to aggregate using MAX, but not ignoring null values? If there's null, it should return null.

Solution

We can do this, but it's quite verbose...

max = F.when(F.expr("every(col_name is not null)"), F.max("col_name")).alias("col_name")
df.agg(max).show()
# +--------+
# |col_name|
# +--------+
# |    null|
# +--------+

Spark 3.5+ (a bit shorter):

max = F.when(F.every(~F.isnull('col_name')), F.max("col_name")).alias("col_name")
df.agg(max).show()