apache-sparkdatepysparkschemaspark-csv

DateType column read as StringType from CSV file even when appropriate schema provided


I am trying to read a CSV file using PySpark containing a DateType field in the format "dd/MM/yyyy". I have specified the field as DateType() in schema definition and also provided the option "dateFormat" in DataFrame CSV reader. However, the output dataframe after read is having the field as StringType() instead of DateType().

Sample input data:

"school_id","gender","class","doj"
"1","M","9","01/01/2020" 
"1","M","10","01/03/2018"
"1","F","10","01/04/2018"
"2","M","9","01/01/2019"
"2","F","10","01/01/2018"

My code:

from pyspark.sql.types import StructField, StructType, StringType, DateType
school_students_schema = StructType([StructField("school_id", StringType(),True) ,\
                             StructField("gender", StringType(),True) ,\
                             StructField("class", StringType(),True) ,\
                             StructField("doj", DateType(),True)    
                            ])

school_students_df = spark.read.format("csv") \
                           .option("header", True) \
                           .option("schema", school_students_schema) \
                           .option("dateFormat", "dd/MM/yyyy") \
                           .load("/user/test/school_students.csv")
school_students_df.printSchema()

Actual output after running the above (column doj parsed as string instead of the specified DateType and dateFormat without any exception).

root
|-- school_id: string (nullable = true)
|-- gender: string (nullable = true)
|-- class: string (nullable = true)
|-- doj: string (nullable = true)

Expected output:

root
|-- school_id: string (nullable = true)
|-- gender: string (nullable = true)
|-- class: string (nullable = true)
|-- doj: date (nullable = true)

Runtime environment

Databricks Community Edition
7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12)

Requesting your help to understand:

  1. Why is the column being parsed as StringType even though DateType is mentioned in schema?
  2. What needs to be done in the code so that the column doj is parsed as DateType()?

Solution

  • You should use

    .schema(school_students_schema)
    

    instead of

    .option("schema", school_students_schema)
    

    (There is no "schema" in the available option list.)