Using this dataframe:
from pyspark.sql import functions as F
df = spark.createDataFrame([(None,), (1,), (2,)], ['col_name'])
df.show()
# +--------+
# |col_name|
# +--------+
# | null|
# | 1|
# | 2|
# +--------+
calculating MAX
ignores nulls by default:
max = F.max('col_name').alias('col_name')
df.agg(max).show()
# +--------+
# |col_name|
# +--------+
# | 2|
# +--------+
Is there a way to aggregate using MAX
, but not ignoring null values? If there's null, it should return null.
We can do this, but it's quite verbose...
max = F.when(F.expr("every(col_name is not null)"), F.max("col_name")).alias("col_name")
df.agg(max).show()
# +--------+
# |col_name|
# +--------+
# | null|
# +--------+
Spark 3.5+ (a bit shorter):
max = F.when(F.every(~F.isnull('col_name')), F.max("col_name")).alias("col_name")
df.agg(max).show()