We were using Spark 2.3 before, now we're on 2.4:
Spark version 2.4.0
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
We had a piece of code running in production that converted csv files to parquet format. One of the options we had set csv load is option("nullValue", null). There's something wrong with how it works in spark 2.4.
Here's an example to show the issue.
C0,C1,C2,C3,C4,C5
1,"1234",0.00,"","D",0.00
2,"",0.00,"","D",0.00
scala> val data1 = spark.read.option("header", "true").option("inferSchema", "true").option("treatEmptyValuesAsNulls","true").option("nullValue", null).csv("file:///tmp/test.csv")
we get an empty row:
scala> data1.show
+----+----+----+----+----+----+
| C0| C1| C2| C3| C4| C5|
+----+----+----+----+----+----+
| 1|1234| 0.0| | D| 0.0|
|null|null|null|null|null|null|
+----+----+----+----+----+----+
C0,C1,C2,C3,C4,C5
1,"1234",0.00,"","D",0.00
2,"",0.00,"1","D",0.00
the result is even worse:
scala> val data2 = spark.read.option("header", "true").option("inferSchema", "true").option("treatEmptyValuesAsNulls","true").option("nullValue", null).csv("file:///tmp/test.csv")
scala> data2.show
+----+----+----+----+----+----+
| C0| C1| C2| C3| C4| C5|
+----+----+----+----+----+----+
|null|null|null|null|null|null|
|null|null|null|null|null|null|
+----+----+----+----+----+----+
Is this a bug in new version of spark 2.4.0 ? Any body faced similar issue ?
spark option emptyValue solved issue
val data2 = spark.read.option("header", "true")
.option("inferSchema", "true")
.option("treatEmptyValuesAsNulls","true")
.option("nullValue", null)***
.option("emptyValue", null)***
.csv("file:///tmp/test.csv")