rparquetapache-arrow

Error package arrowR : "TProtocolException: Exceeded size limit" - Is it possible to read the parquet file


Following this question in SO relating to this error :

Error in `open_dataset()`:
! IOError: Error creating dataset. Could not read schema from 'path/example.parquet'. 
Is this a 'parquet' file?: 
Could not open Parquet input source 'path/example.parquet':
Couldn't deserialize thrift: TProtocolException: Exceeded size limit

I'm wondering if there is any way to still be able to use the parquet file, even though the schema is oversaturared by attributes ? This situation would occur when the script that produced the parquet file was deleted, or it would cost a lot of time to ask/provide a cleaner parquet file.

I already tried to used the partitioning option but the error still occured, no matter how many variables I used to partition it.

Repex :

library(arrow)
library(data.table)
# Seed
set.seed(1L)
# Big enough data.table 
dt = data.table(x = sample(1e5L, 1e7L, TRUE), y = runif(100L)) 
# Save in parquet format
write_parquet(dt, "example_ok.parquet")
# Readable
dt_ok <- open_dataset("example_ok.parquet")
# Simple filter 
dt[x == 989L]
# Save in parquet format
write_parquet(dt, "example_error.parquet")
# Error
dt_error <- open_dataset("example_error.parquet")

Solution

  • As mentionned in the gitlab issue, adding the parameter thrift_string_size_limit makes it possible to read the parquet file. Please note that the parquet file's size is still very much saturated.

    dt_error <- open_dataset("example_error.parquet", thrift_string_size_limit = 1000000000)