[EDIT : Please see the answer below to understand the error and this SO question if you need to open the .parquet file]
My institution is slowly transiting from SAS to R, most of the code is written in arrow/dplyr or data.table, using the .parquet format as its main storing format. In my own work I am usually dealing with storing and analysing data from 1 million to 10 million rows and up to 150-200 columns. Parquet format is great for this kind of usage, but an unusual error has been occuring recently, and I couldn't find any ressource on the internet :
library(arrow)
library(tidyverse)
open_dataset(data_error)
Error in `open_dataset()`:
! IOError: Error creating dataset. Could not read schema from 'path/example.parquet'.
Is this a 'parquet' file?:
Could not open Parquet input source 'path/example.parquet':
Couldn't deserialize thrift: TProtocolException: Exceeded size limit
The same would happen with the function read_parquet
.
data_error
?data_error
is just a typical data.frame, extracted from a bigger data source (let's call it data_clean
), through a few data.table
processes and saved by write_parquet
unpartitionned. Please note that this error does not occur if the parquet file is partionnized.
This error first happened on a program on data.table that I didn't write, and I'm not familiar enough with data.table to understand the underlying issue.
library(arrow)
library(data.table)
# Seed
set.seed(1L)
# Big enough data.table
dt = data.table(x = sample(1e5L, 1e7L, TRUE), y = runif(100L))
# Save in parquet format
write_parquet(dt, "example_ok.parquet")
# Readable
dt_ok <- open_dataset("example_ok.parquet")
# Simple filter
dt[x == 989L]
# Save in parquet format
write_parquet(dt, "example_error.parquet")
# Error
dt_error <- open_dataset("example_error.parquet")
Thank you all for your help !
The culprit is that once you call dt[x == 989L]
, an index is created in the data.table
.
set.seed(1L)
dt = data.table(x = sample(1e5L, 1e7L, TRUE), y = runif(100L))
attr1 <- attributes(dt)
dt[x == 989L]
attr2 <- attributes(dt)
str(attr1)
# List of 4
# $ names : chr [1:2] "x" "y"
# $ row.names : int [1:10000000] 1 2 3 4 5 6 7 8 9 10 ...
# $ class : chr [1:2] "data.table" "data.frame"
# $ .internal.selfref:<externalptr>
str(attr2)
# List of 5
# $ names : chr [1:2] "x" "y"
# $ row.names : int [1:10000000] 1 2 3 4 5 6 7 8 9 10 ...
# $ class : chr [1:2] "data.table" "data.frame"
# $ .internal.selfref:<externalptr>
# $ index : int(0)
# ..- attr(*, "__x")= int [1:10000000] 17660 25871 28519 270694 275019 419020 437190 615578 628622 739696 ...
Notice the addition of the index
attribute.
The default action of arrow
is to store attributes; one nice side-effect of this is that dt_ok
will actually be of class data.table
:
head(dt_ok) |> collect() # assuming dplyr::collect is visible?
# x y
# <int> <num>
# 1: 24388 0.4023457
# 2: 59521 0.9142361
# 3: 43307 0.2847435
# 4: 69586 0.3440578
# 5: 11571 0.1822614
# 6: 25173 0.8130521
The file size is also adversely affected (not sure if you are aware of this):
file.info(list.files(pattern = "*parquet"))
# size isdir mode mtime ctime atime uid gid uname grname
# example_error.parquet 209818297 FALSE 664 2024-02-12 11:34:05 2024-02-12 11:34:05 2024-02-12 11:34:05 1000 1000 r2 r2
# example_ok.parquet 25744071 FALSE 664 2024-02-12 11:33:58 2024-02-12 11:33:58 2024-02-12 11:33:58 1000 1000 r2 r2
Clearly the _error
file has something more. The normal efficiency of binary-data-storage in parquet
files is not afforded to R attributes, so it makes sense that 10Mi values in a vector stored less-efficiently would take up that space.
If we remove the index
, the problem goes away. One way to remove the index is to manually set the order:
write_parquet(dt, "example_error.parquet")
setorder(dt)
names(attributes(dt))
# [1] "names" "row.names" "class" ".internal.selfref"
write_parquet(dt, "example_ok2.parquet")
open_dataset("example_error.parquet")
# Error in open_dataset("example_error.parquet") :
# IOError: Error creating dataset. Could not read schema from '.../example_error.parquet'. Is this a 'parquet' file?: Could not open Parquet input source '.../example_error.parquet': Couldn't deserialize thrift: TProtocolException: Exceeded size limit
open_dataset("example_ok2.parquet")
# FileSystemDataset with 1 Parquet file
# x: int32
# y: double
# See $metadata for additional Schema metadata
My immediate thought is that this is a bug, perhaps due to the size of the attribute. For demonstration, if we instead repeat this with 100 rows, we have no problem.
set.seed(1L)
dt = data.table(x = sample(1e5L, 1e2L, TRUE), y = runif(100L))
write_parquet(dt, "small1.parquet")
open_dataset("small1.parquet")
# FileSystemDataset with 1 Parquet file
# x: int32
# y: double
# See $metadata for additional Schema metadata
dt[x == 8229L]
# x y
# <int> <num>
# 1: 8229 0.62041
str(attributes(dt))
# List of 5
# $ names : chr [1:2] "x" "y"
# $ row.names : int [1:100] 1 2 3 4 5 6 7 8 9 10 ...
# $ class : chr [1:2] "data.table" "data.frame"
# $ .internal.selfref:<externalptr>
# $ index : int(0)
# ..- attr(*, "__x")= int [1:100] 35 50 21 78 98 33 9 77 88 94 ...
write_parquet(dt, "small2.parquet")
open_dataset("small2.parquet")
# FileSystemDataset with 1 Parquet file
# x: int32
# y: double
# See $metadata for additional Schema metadata
I suggest (request, even) that you submit a bug report.