I was reading a spark DF with options below:
testDF = spark.read.format("parquet").option("header", "true") \
.option("mergeSchema", "true").option("inferSchema", "true").load("folderPath/*/*")
However, this fails because one of the col (Date) is of type timestamp in some source files and is of type string in some files. I don't have control over the data producers, so wanted to know how can I handle this while processing.
Challenge is it's randomly either timestamp or string across files.
Thanks.
Since you're using spark to read parquet file, one of the advantages is that you can use schema-on-read on the fly approach, which means that you can declare the schema when you read the data. You can:
schema = types.StructType([
types.StructField('date', types.TimestampType()),
... # declartion of other columns
])
testDF = spark.read.format("parquet")\
.option('mergeSchema', 'true')\
.schema(schema=schema)\
.load("folderPath/*/*")