Following this question in SO relating to this error :
Error in `open_dataset()`:
! IOError: Error creating dataset. Could not read schema from 'path/example.parquet'.
Is this a 'parquet' file?:
Could not open Parquet input source 'path/example.parquet':
Couldn't deserialize thrift: TProtocolException: Exceeded size limit
I'm wondering if there is any way to still be able to use the parquet file, even though the schema is oversaturared by attributes ? This situation would occur when the script that produced the parquet file was deleted, or it would cost a lot of time to ask/provide a cleaner parquet file.
I already tried to used the partitioning option but the error still occured, no matter how many variables I used to partition it.
library(arrow)
library(data.table)
# Seed
set.seed(1L)
# Big enough data.table
dt = data.table(x = sample(1e5L, 1e7L, TRUE), y = runif(100L))
# Save in parquet format
write_parquet(dt, "example_ok.parquet")
# Readable
dt_ok <- open_dataset("example_ok.parquet")
# Simple filter
dt[x == 989L]
# Save in parquet format
write_parquet(dt, "example_error.parquet")
# Error
dt_error <- open_dataset("example_error.parquet")
As mentionned in the gitlab issue, adding the parameter thrift_string_size_limit
makes it possible to read the parquet file. Please note that the parquet file's size is still very much saturated.
dt_error <- open_dataset("example_error.parquet", thrift_string_size_limit = 1000000000)