parquetpyarrowsnappy

Snappy compression and runs of repeated values in a Parquet column


I am working with a data frame of 100m rows, that I would like to partition into 100 Parquet files of 1m rows each. I do not want to partition on any particular column value: I just want 100 chunks of 1m rows.

I know that this is possible by adding a "dummy" column, and passing that to partition_cols:

data_size = len(data)
partition_size = 1_000_000
n_partitions, remainder = divmod(data_size, partition_size)
data["partition_id"] = np.concatenate([
    np.repeat(list(range(n_partitions)), partition_size),
    np.repeat(n_partitions + 1, remainder),
])
data.to_parquet("out", partition_cols=["partition_id"])

But it feels wasteful to write an extra 100m 64-bit integers!

Parquet files are also typically compressed, very often using the Snappy algorithm (occasionally GZip or Brotli). And these are long runs of identical integers, so in principle they should compress extremely well.

However, I don't know how the Parquet file format and underlying Arrow array format interact with various compression algorithms. Assuming that I'm using Snappy, will my millions of extra integers be compressed to a handful of bytes? Or will this partition_id column actually inflate the size of my dataset by some appreciable amount?


Solution

  • To answer your actual question: Yes, as there are only a 100 distinct values they compress very well. As you can see below there is no significant influence on the over all size. Partitioning introduces some overhead of course but that happens with and without ids.

    (This is R as I am more familiar with it but it uses the same C++ arrow backend so the results are the same)

    library(arrow)
    library(fs)
    
    # create some dummy data
    n <- 100000000
    
    data <- data.frame(a = rnorm(n), b = rnorm(n))
    partition_id <- rep(1:100, each = 1000000)
    data_ids <- cbind(data, partition_id)
    
    # Save data 
    file <- file_temp("data", ext = ".parquet")
    file_with_id <- file_temp("data", ext = ".parquet")
    path <- path_temp("data_partitioned")
    path_nrow <- path_temp("data_partitioned_nrow")
    
    write_parquet(data, file, compression = "snappy")
    write_parquet(data.frame(rep(1,1000000)), just_ids, compression = "snappy")
    write_parquet(data_ids, file_with_id, compression = "snappy")
    write_dataset(data_ids, path, format = "parquet", partitioning = "partition_id")
    write_dataset(data, path_nrow, format = "parquet", max_rows_per_file = 1000000)
    
    
    file_size(file)
    # 1.49G
    file_size(file_with_id)
    # 1.49G
    dir_ls(path, recurse = TRUE) |> file_size() |> sum()
    # 1.84G
    dir_ls(path_nrow, recurse = TRUE) |> file_size() |> sum()
    # 1.84G