I have a requirement, where I have a huge dataset of over 2 Trillion records. This comes as a result of some join. And post this join, I need to aggregate on a column ('id' column) and get a list of distinct names (collect_set('name')).
Now, while saving the join result in step1, if I re-partition it on 'id' field, will I get any benefit? i.e. joined_df.repartition('id').write.parquet(path)
If I read the above repartitioned df, will spark understand that it is already repartitioned on id field, so that when I group by id, performance is hugely improved?
If the id
column is unique, then you just add a huge overhead to partition by this column, since each partition will contain one record, so assuming this is not the case!
Calling repartition('id')
will create partitions based on the id
column, but it will not influence Spark that the data is already partitioned when reading back.
If the data of each id
can fit in one partition, I'd say you could try to:
id
is in one partition only.id
(logically) you can avoid the extra group by and map the partitions directly.Example:
joined_df.repartition(1, 'id').write.parquet(path)
...
spark.read.parquet(path).rdd.mappartitions(FUN).toDF([id, ...])