I use Spark 2.2.0
I am reading a csv file as follows:
val dataFrame = spark.read.option("inferSchema", "true")
.option("header", true)
.option("dateFormat", "yyyyMMdd")
.csv(pathToCSVFile)
There is one date column in this file, and all records has a value equal to 20171001
for this particular column.
The issue is that spark is inferring that that the type of this column is integer
rather than date
. When I remove the "inferSchema"
option, the type of that column is string
.
There is no null
values, nor any wrongly formatted line in this file.
What is the reason/solution for this issue?
If my understanding is correct, the code implies the following order of type inference (with the first types being checked against first):
NullType
IntegerType
LongType
DecimalType
DoubleType
TimestampType
BooleanType
StringType
With that, I think the issue is that 20171001
matches IntegerType
before even considering TimestampType
(which uses timestampFormat
not dateFormat
option).
One solution would be to define the schema and use it with schema
operator (of DataFrameReader
) or let Spark SQL infer the schema and use cast
operator.
I'd choose the former if the number of fields is not high.