This might be a dumb question, But is there any difference between manually specifying the partition columns in a parquet file, as opposed to loading it and then filtering them?
For Example:
I have a parquet file that is partitioned by DATE
. If I want the last 10 days, I can typically do the following: (assuming today is 2020-10-26)
df=spark.read.parquet("s3://bucket/path/file.parquet")\
.filter(col('DATE')>'2020-10-15')
Or I can use the S3 file system to load only the required partitions in the spark data frame like so:
inpath="s3://bucket/path/file.parquet/"
datepath=s3fs.S3FileSystem.ls(inpath)
dates=[]
for a in range(len(datepath)-10, len(datepath)):
path="s3://" + datepath[a] + "/"
dates=append(path)
df=spark.read.option("basePath", inpath).parquet(*dates)
The reason to do it the elaborate way in the 2nd method (in my thinking) was so that I did not have to load the entire parquet file with all the dates in memory and then filter them. I wanted to find out if my assumption is true.
Please advise. Thank You
You can do both, but Spark predicate push-down
has been around for a while, making your life easier. E.g. partition pruning and use of parquet stats such as min/max. See https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example from 2017 and it also applies to pyspark.
Not everything van be pushed down, but other operators that can be pushed down are "<, <=, > , >=" as the link from 2017 shows. You can use .explain
to check how the Optimizer does or does not apply predicate push-down.