scalaapache-sparkdatabricksspark-csv

Spark 2.4 CSV Load Issue with option "nullvalue"


We were using Spark 2.3 before, now we're on 2.4:

Spark version 2.4.0
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)

We had a piece of code running in production that converted csv files to parquet format. One of the options we had set csv load is option("nullValue", null). There's something wrong with how it works in spark 2.4.

Here's an example to show the issue.

  1. let's create the following /tmp/test.csv file:
C0,C1,C2,C3,C4,C5
1,"1234",0.00,"","D",0.00
2,"",0.00,"","D",0.00

  1. Now if we load it in spark-shell
scala> val data1 = spark.read.option("header", "true").option("inferSchema", "true").option("treatEmptyValuesAsNulls","true").option("nullValue", null).csv("file:///tmp/test.csv")

we get an empty row:
scala> data1.show
+----+----+----+----+----+----+
| C0| C1| C2| C3| C4| C5|
+----+----+----+----+----+----+
| 1|1234| 0.0| | D| 0.0|
|null|null|null|null|null|null|
+----+----+----+----+----+----+

  1. If we additionally change the csv a little (replaced empty string with "1" in the last row)
C0,C1,C2,C3,C4,C5
1,"1234",0.00,"","D",0.00
2,"",0.00,"1","D",0.00

the result is even worse:

scala> val data2 = spark.read.option("header", "true").option("inferSchema", "true").option("treatEmptyValuesAsNulls","true").option("nullValue", null).csv("file:///tmp/test.csv")

scala> data2.show
+----+----+----+----+----+----+
| C0| C1| C2| C3| C4| C5|
+----+----+----+----+----+----+
|null|null|null|null|null|null|
|null|null|null|null|null|null|
+----+----+----+----+----+----+

Is this a bug in new version of spark 2.4.0 ? Any body faced similar issue ?


Solution

  • spark option emptyValue solved issue

    val data2 = spark.read.option("header", "true")
         .option("inferSchema", "true")
         .option("treatEmptyValuesAsNulls","true")
         .option("nullValue", null)***
         .option("emptyValue", null)***
         .csv("file:///tmp/test.csv")