apache-sparkpysparkparquethadoop-partitioning

Hand selecting parquet partitions vs filtering them in pyspark


This might be a dumb question, But is there any difference between manually specifying the partition columns in a parquet file, as opposed to loading it and then filtering them?

For Example: I have a parquet file that is partitioned by DATE. If I want the last 10 days, I can typically do the following: (assuming today is 2020-10-26)

df=spark.read.parquet("s3://bucket/path/file.parquet")\
.filter(col('DATE')>'2020-10-15')

Or I can use the S3 file system to load only the required partitions in the spark data frame like so:

inpath="s3://bucket/path/file.parquet/"
datepath=s3fs.S3FileSystem.ls(inpath)
dates=[]
for a in range(len(datepath)-10, len(datepath)):
    path="s3://" + datepath[a] + "/"
    dates=append(path)
df=spark.read.option("basePath", inpath).parquet(*dates)

The reason to do it the elaborate way in the 2nd method (in my thinking) was so that I did not have to load the entire parquet file with all the dates in memory and then filter them. I wanted to find out if my assumption is true.

Please advise. Thank You


Solution

  • You can do both, but Spark predicate push-down has been around for a while, making your life easier. E.g. partition pruning and use of parquet stats such as min/max. See https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example from 2017 and it also applies to pyspark.

    Not everything van be pushed down, but other operators that can be pushed down are "<, <=, > , >=" as the link from 2017 shows. You can use .explain to check how the Optimizer does or does not apply predicate push-down.