I am working with a data frame of 100m rows, that I would like to partition into 100 Parquet files of 1m rows each. I do not want to partition on any particular column value: I just want 100 chunks of 1m rows.
I know that this is possible by adding a "dummy" column, and passing that to partition_cols
:
data_size = len(data)
partition_size = 1_000_000
n_partitions, remainder = divmod(data_size, partition_size)
data["partition_id"] = np.concatenate([
np.repeat(list(range(n_partitions)), partition_size),
np.repeat(n_partitions + 1, remainder),
])
data.to_parquet("out", partition_cols=["partition_id"])
But it feels wasteful to write an extra 100m 64-bit integers!
Parquet files are also typically compressed, very often using the Snappy algorithm (occasionally GZip or Brotli). And these are long runs of identical integers, so in principle they should compress extremely well.
However, I don't know how the Parquet file format and underlying Arrow array format interact with various compression algorithms. Assuming that I'm using Snappy, will my millions of extra integers be compressed to a handful of bytes? Or will this partition_id
column actually inflate the size of my dataset by some appreciable amount?
To answer your actual question: Yes, as there are only a 100 distinct values they compress very well. As you can see below there is no significant influence on the over all size. Partitioning introduces some overhead of course but that happens with and without ids.
(This is R as I am more familiar with it but it uses the same C++ arrow backend so the results are the same)
library(arrow)
library(fs)
# create some dummy data
n <- 100000000
data <- data.frame(a = rnorm(n), b = rnorm(n))
partition_id <- rep(1:100, each = 1000000)
data_ids <- cbind(data, partition_id)
# Save data
file <- file_temp("data", ext = ".parquet")
file_with_id <- file_temp("data", ext = ".parquet")
path <- path_temp("data_partitioned")
path_nrow <- path_temp("data_partitioned_nrow")
write_parquet(data, file, compression = "snappy")
write_parquet(data.frame(rep(1,1000000)), just_ids, compression = "snappy")
write_parquet(data_ids, file_with_id, compression = "snappy")
write_dataset(data_ids, path, format = "parquet", partitioning = "partition_id")
write_dataset(data, path_nrow, format = "parquet", max_rows_per_file = 1000000)
file_size(file)
# 1.49G
file_size(file_with_id)
# 1.49G
dir_ls(path, recurse = TRUE) |> file_size() |> sum()
# 1.84G
dir_ls(path_nrow, recurse = TRUE) |> file_size() |> sum()
# 1.84G