apache-sparkdataframeapache-spark-2.0

get latest schema for partitionned parquet dataframe


We are starting to collect data in a hadoop cluster using spark and parquet files... but it is very difficult for us to garantee that the parquet schema will not change in the future. We try to find the best way to read parquets, even if schema changes...

The rule we want to implement is that the latest parquet file will be our reference...

We made different tests including :

spark.read.parquet("test").filter("year=2017 and month=10 and day>=15")
spark.read.parquet("test/year=2017/month=10/day=17", "test/year=2017/month=10/day=16", "test/year=2017/month=10/day=15")
// tested with different order
spark.read.parquet("test/year=2017/month=10/day={15,16,17}")

etc...

and the schema retained by the read method is always the oldest schema (ie 15th october's one).

Does someone to know how to get the latest schema (ie 17th october's one) ?

Of course spark.read.option("mergeSchema", "true") does not work because it does not remove a column if we dropped one in the latest parquet. We made test over 3 days here... but potentially it could be over a very large range of partitions.

Thanks in advance

Regards


Solution

  • I am writing this in pyspark. Should be applicable for other language.

    schema = spark.read.parquet("test/year=2017/month=10/day=17/").schema
    df = spark.read.schema(schema).parquet("test/*/*/*/")