I use Spark to perform data transformations that I load into Redshift. Redshift does not support NaN values, so I need to replace all occurrences of NaN with NULL.
I tried something like this:
some_table = sql('SELECT * FROM some_table')
some_table = some_table.na.fill(None)
But I got the following error:
ValueError: value should be a float, int, long, string, bool or dict
So it seems like na.fill()
doesn't support None. I specifically need to replace with NULL
, not some other value, like 0
.
I finally found the answer after Googling around a bit.
import pyspark.sql.functions as F
df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()
+----+---+
| a| b|
+----+---+
| 1|NaN|
|null|1.0|
+----+---+
for column in df.columns:
df = (
# usage of "isnan()" function to test and replace values
df.withColumn(column,
F.when(F.isnan(F.col(column)), F.lit(None))
.otherwise(F.col(column)))
)
sqlContext.registerDataFrameAsTable(df, "df2")
sql('select * from df2').show()
+----+----+
| a| b|
+----+----+
| 1|null|
|null| 1.0|
+----+----+
It doesn't use na.fill()
, but it accomplished the same result, so I'm happy.