pythonapache-spark-sql

Pyspark replace NaN with NULL


I use Spark to perform data transformations that I load into Redshift. Redshift does not support NaN values, so I need to replace all occurrences of NaN with NULL.

I tried something like this:

some_table = sql('SELECT * FROM some_table')
some_table = some_table.na.fill(None)

But I got the following error:

ValueError: value should be a float, int, long, string, bool or dict

So it seems like na.fill() doesn't support None. I specifically need to replace with NULL, not some other value, like 0.


Solution

  • I finally found the answer after Googling around a bit.

    import pyspark.sql.functions as F
    df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
    df.show()
    
    +----+---+
    |   a|  b|
    +----+---+
    |   1|NaN|
    |null|1.0|
    +----+---+
    
    for column in df.columns:
        df = (
             # usage of "isnan()" function to test and replace values
             df.withColumn(column,
                       F.when(F.isnan(F.col(column)), F.lit(None))
                       .otherwise(F.col(column)))
        )
    
    sqlContext.registerDataFrameAsTable(df, "df2")
    sql('select * from df2').show()
    
    +----+----+
    |   a|   b|
    +----+----+
    |   1|null|
    |null| 1.0|
    +----+----+
    

    It doesn't use na.fill(), but it accomplished the same result, so I'm happy.