When writing a file to a Data Lake, specifically through Databricks, we have the option of specifying a partition column. This will save the data in separate folders (partitions) based on the values available in that column of the dataset.
At the same time, when we talk about Spark optimizations, we talk about partitioning the data.
What is the difference between these two? Is there a relationship between them?
From what I can understand, saving the data in the distributed file system in partitions will help when we want to read in only certain portion of the data (based on the partition column of course). For example, if we partition by color and we are only interested in the 'red' records, we can read in only that partition and ignore the rest. This results in some level of optimization when reading in the data.
Then, for Spark to perform parallel processing, this 'red' partition (from the file system) will be divided into partitions (Spark) based on the number of cores available in the cluster?
Is this correct? How does Spark decide the number of partitions? Is this number always equal to the number of cores in the cluster?
What is the idea of re-partitioning? I believe this involves the use of the coalesce()
and repartition()
functions. What causes Spark to re-partition data?
saving into a partition(folders) and spark partitions both partition data but that's about where there similarities end.
If you will query data by filtering on a particular(folder) column often then it makes sense to use save the data in partitions(folder). If you are going to roll data up across columns then it's actually better not to partition(folder) the data by a column.
Spark partitions are generally internal decisions of your data. Said another way: Spark partitions are the measure of parallelism in your data. Ideally your most expensive action that you perform evenly divides your partitions so that each core is busy at the same time and no executor is lagging. (Often a sign of skew).
200 is the default number of partitions in spark. Ideally you want to set this as I said to some multiple of the number of cores that will be working on your data. In general don't mess with the number of partitions until you need to tune performance. Generally speaking more tuning can be accomplished by revisiting your algorithms and spark features that you use, than tuning the number of partitions. (But there are edge cases where you really can get better performance like skew.) If every second counts.. then yeah maybe look at partitions after you have already tuned everything else.
Repartitioning can be done at any time to speed up calculations. You can specify a number smaller (repartion/coalesce) and it will collapse your partitions.(using a hash function to consolidate) You might do this to write out one file to disk, or to condense your data maximize throughput. (In the ideal world that we don't live in you want all your partitions to be just smaller than the blocks size of the file system your on. This means that you are maximizing your block read. In hdfs this is commonly 128M but you can tune this to your needs.)
You may wish to grow the number of partitions(using repartition) or just redistribute the data in a manner that is advantageous. repartition
does accept a custom function you can use to specify how you want the data partitioned. You might use this to get around data skew, but now spark will do that for you if you tell it to. As I discussed above you may wish to grow/shrink the number of partitions to match a mutliple of your cores. You likely want to use spark.sql.shuffle.partitions
by the way that is briefly explained here.