repartition() creates partition in memory and is used as a read() operation. partitionBy() creates partition in disk and is used as a write operation.
articles.repartition(1).write.saveAsTable("articles_table", format = 'orc', mode = 'overwrite')
, why does this operation only creates one file? And how is this different from partitionBy()?partitionBy
indeed has an effect on how your files will look on disk, and indeed is used when writing a file (it is a method of the DataFrameWriter
class).
That, however, does not mean that the repartition
has no effect at all on what will be written to disk.
Let's take the following example:
df = spark.createDataFrame([
(1,2,3),
(2,2,3),
(3,20,300),
(1,24,299),
(5,26,312),
(5,28,322),
(5,9,2)
], ["colA", "colB", "colC"])
df.write.partitionBy("colA").parquet("using_partitionBy.parquet")
df.repartition(4).write.parquet("using_repartition.parquet")
In here, we create a simple dataframe and write it away using 2 methods:
partitionBy
The output file structure on disk looks like this:
tree using_partitionBy.parquet/
using_partitionBy.parquet/
├── colA=1
│ ├── part-00000-17e47d08-2bbe-4ad6-809e-c466b396de4e.c000.snappy.parquet
│ └── part-00002-17e47d08-2bbe-4ad6-809e-c466b396de4e.c000.snappy.parquet
├── colA=2
│ └── part-00001-17e47d08-2bbe-4ad6-809e-c466b396de4e.c000.snappy.parquet
├── colA=3
│ └── part-00001-17e47d08-2bbe-4ad6-809e-c466b396de4e.c000.snappy.parquet
├── colA=5
│ ├── part-00002-17e47d08-2bbe-4ad6-809e-c466b396de4e.c000.snappy.parquet
│ └── part-00003-17e47d08-2bbe-4ad6-809e-c466b396de4e.c000.snappy.parquet
└── _SUCCESS
We see that this created 6 "subfiles", in 4 "subdirectories". Information about the data values (like colA=1
) is actually stored on disk. This enables you to do big improvements in subsequent queries that would need to read this file. Imagine that you would need to read all the values where colA=1
, that would be a trivial task (ignore the other subdirectories).
repartition(4)
The output file structure on disk looks like this:
tree using_repartition.parquet/
using_repartition.parquet/
├── part-00000-6dafa06d-cea6-4af0-ba0d-9c761de0fd50-c000.snappy.parquet
├── part-00001-6dafa06d-cea6-4af0-ba0d-9c761de0fd50-c000.snappy.parquet
├── part-00002-6dafa06d-cea6-4af0-ba0d-9c761de0fd50-c000.snappy.parquet
├── part-00003-6dafa06d-cea6-4af0-ba0d-9c761de0fd50-c000.snappy.parquet
└── _SUCCESS
We see that 4 "subfiles" were created and NO "subdirectories" were made. Actually these "subfiles" represent your partitions inside of Spark. Since you're typically working with very big data in Spark, all your data has to be partitioned some way.
Each partition will be processed by 1 task, which can be taken up by 1 core of your cluster. Once this task is taken up by a core and after doing all the necessary processing, your core will write away this output on disk in one of these "subfiles". When it has finished writing away this "subfile", your core is ready to read another partition.
partitionBy
and repartition
This is a bit opinionated and surely not exhaustive, but might give you some insight into what to use.
partitionBy
and repartition
can be used for different goals:
partitionBy
if:
repartition
if:
partitionBy
on any column would have a way too high cardinality (imagine time series data on sensors)