I am trying to read a CSV file using PySpark containing a DateType field in the format "dd/MM/yyyy". I have specified the field as DateType() in schema definition and also provided the option "dateFormat" in DataFrame CSV reader. However, the output dataframe after read is having the field as StringType() instead of DateType().
Sample input data:
"school_id","gender","class","doj"
"1","M","9","01/01/2020"
"1","M","10","01/03/2018"
"1","F","10","01/04/2018"
"2","M","9","01/01/2019"
"2","F","10","01/01/2018"
My code:
from pyspark.sql.types import StructField, StructType, StringType, DateType
school_students_schema = StructType([StructField("school_id", StringType(),True) ,\
StructField("gender", StringType(),True) ,\
StructField("class", StringType(),True) ,\
StructField("doj", DateType(),True)
])
school_students_df = spark.read.format("csv") \
.option("header", True) \
.option("schema", school_students_schema) \
.option("dateFormat", "dd/MM/yyyy") \
.load("/user/test/school_students.csv")
school_students_df.printSchema()
Actual output after running the above (column doj parsed as string instead of the specified DateType and dateFormat without any exception).
root
|-- school_id: string (nullable = true)
|-- gender: string (nullable = true)
|-- class: string (nullable = true)
|-- doj: string (nullable = true)
Expected output:
root
|-- school_id: string (nullable = true)
|-- gender: string (nullable = true)
|-- class: string (nullable = true)
|-- doj: date (nullable = true)
Runtime environment
Databricks Community Edition
7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12)
Requesting your help to understand:
You should use
.schema(school_students_schema)
instead of
.option("schema", school_students_schema)
(There is no "schema" in the available option
list.)