apache-sparkpyspark

Efficient way to find columns that contain ANY null values


I want to return a list of all columns that contain at least 1 null value. All of the other similar questions I have seen on StackOverflow are filtering the column where the value is null, but this is definitely sub-optimal since it has to find ALL the null values, but I just want to find ONE null value.

I could filter the column where the value is null, and then if the count of this result is greater than 1, then I know the column contains a null value. However, as I said this is suboptimal as it first finds all null values.

Is there any way to do this?

Furthermore, is there any way to do this without looping over all the columns?


Solution

  • Spark's SQL function any can check if any value of a column meets a condition.

    from pyspark.sql import functions as F
    
    data = [[1,2,3],[None, 5, 6], [7, None, 9]]
    df = spark.createDataFrame(data, schema=["col1", "col2", "col3"])
    
    cols = [f"any({col} is null) as {col}_contains_null" for col in df.columns]
    df.selectExpr(cols).show()
    

    Output:

    +------------------+------------------+------------------+
    |col1_contains_null|col2_contains_null|col3_contains_null|
    +------------------+------------------+------------------+
    |              true|              true|             false|
    +------------------+------------------+------------------+