rdata.tableapache-arrow

Error package arrowR : read_parquet/open_dataset "Couldn't deserialize thrift: TProtocolException: Exceeded size limit"


[EDIT : Please see the answer below to understand the error and this SO question if you need to open the .parquet file]

My institution is slowly transiting from SAS to R, most of the code is written in arrow/dplyr or data.table, using the .parquet format as its main storing format. In my own work I am usually dealing with storing and analysing data from 1 million to 10 million rows and up to 150-200 columns. Parquet format is great for this kind of usage, but an unusual error has been occuring recently, and I couldn't find any ressource on the internet :

library(arrow)
library(tidyverse)

open_dataset(data_error)
Error in `open_dataset()`:
! IOError: Error creating dataset. Could not read schema from 'path/example.parquet'. 
Is this a 'parquet' file?: 
Could not open Parquet input source 'path/example.parquet':
Couldn't deserialize thrift: TProtocolException: Exceeded size limit

The same would happen with the function read_parquet.

What is data_error ?

data_error is just a typical data.frame, extracted from a bigger data source (let's call it data_clean), through a few data.table processes and saved by write_parquet unpartitionned. Please note that this error does not occur if the parquet file is partionnized.

This error first happened on a program on data.table that I didn't write, and I'm not familiar enough with data.table to understand the underlying issue.

repex :

library(arrow)
library(data.table)
# Seed
set.seed(1L)
# Big enough data.table 
dt = data.table(x = sample(1e5L, 1e7L, TRUE), y = runif(100L)) 
# Save in parquet format
write_parquet(dt, "example_ok.parquet")
# Readable
dt_ok <- open_dataset("example_ok.parquet")
# Simple filter 
dt[x == 989L]
# Save in parquet format
write_parquet(dt, "example_error.parquet")
# Error
dt_error <- open_dataset("example_error.parquet")

Thank you all for your help !


Solution

  • The culprit is that once you call dt[x == 989L], an index is created in the data.table.

    set.seed(1L)
    dt = data.table(x = sample(1e5L, 1e7L, TRUE), y = runif(100L))
    attr1 <- attributes(dt)
    dt[x == 989L]
    attr2 <- attributes(dt)
    str(attr1)
    # List of 4
    #  $ names            : chr [1:2] "x" "y"
    #  $ row.names        : int [1:10000000] 1 2 3 4 5 6 7 8 9 10 ...
    #  $ class            : chr [1:2] "data.table" "data.frame"
    #  $ .internal.selfref:<externalptr> 
    str(attr2)
    # List of 5
    #  $ names            : chr [1:2] "x" "y"
    #  $ row.names        : int [1:10000000] 1 2 3 4 5 6 7 8 9 10 ...
    #  $ class            : chr [1:2] "data.table" "data.frame"
    #  $ .internal.selfref:<externalptr> 
    #  $ index            : int(0) 
    #   ..- attr(*, "__x")= int [1:10000000] 17660 25871 28519 270694 275019 419020 437190 615578 628622 739696 ...
    

    Notice the addition of the index attribute.

    The default action of arrow is to store attributes; one nice side-effect of this is that dt_ok will actually be of class data.table:

    head(dt_ok) |> collect() # assuming dplyr::collect is visible?
    #        x         y
    #    <int>     <num>
    # 1: 24388 0.4023457
    # 2: 59521 0.9142361
    # 3: 43307 0.2847435
    # 4: 69586 0.3440578
    # 5: 11571 0.1822614
    # 6: 25173 0.8130521
    

    The file size is also adversely affected (not sure if you are aware of this):

    file.info(list.files(pattern = "*parquet"))
    #                            size isdir mode               mtime               ctime               atime  uid  gid uname grname
    # example_error.parquet 209818297 FALSE  664 2024-02-12 11:34:05 2024-02-12 11:34:05 2024-02-12 11:34:05 1000 1000    r2     r2
    # example_ok.parquet     25744071 FALSE  664 2024-02-12 11:33:58 2024-02-12 11:33:58 2024-02-12 11:33:58 1000 1000    r2     r2
    

    Clearly the _error file has something more. The normal efficiency of binary-data-storage in parquet files is not afforded to R attributes, so it makes sense that 10Mi values in a vector stored less-efficiently would take up that space.

    If we remove the index, the problem goes away. One way to remove the index is to manually set the order:

    write_parquet(dt, "example_error.parquet")
    setorder(dt)
    names(attributes(dt))
    # [1] "names"             "row.names"         "class"             ".internal.selfref"
    write_parquet(dt, "example_ok2.parquet")
    open_dataset("example_error.parquet")
    # Error in open_dataset("example_error.parquet") : 
    #   IOError: Error creating dataset. Could not read schema from '.../example_error.parquet'. Is this a 'parquet' file?: Could not open Parquet input source '.../example_error.parquet': Couldn't deserialize thrift: TProtocolException: Exceeded size limit
    open_dataset("example_ok2.parquet")
    # FileSystemDataset with 1 Parquet file
    # x: int32
    # y: double
    # See $metadata for additional Schema metadata
    

    My immediate thought is that this is a bug, perhaps due to the size of the attribute. For demonstration, if we instead repeat this with 100 rows, we have no problem.

    set.seed(1L)
    dt = data.table(x = sample(1e5L, 1e2L, TRUE), y = runif(100L))
    write_parquet(dt, "small1.parquet")
    open_dataset("small1.parquet")
    # FileSystemDataset with 1 Parquet file
    # x: int32
    # y: double
    # See $metadata for additional Schema metadata
    dt[x == 8229L]
    #        x       y
    #    <int>   <num>
    # 1:  8229 0.62041
    str(attributes(dt))
    # List of 5
    #  $ names            : chr [1:2] "x" "y"
    #  $ row.names        : int [1:100] 1 2 3 4 5 6 7 8 9 10 ...
    #  $ class            : chr [1:2] "data.table" "data.frame"
    #  $ .internal.selfref:<externalptr> 
    #  $ index            : int(0) 
    #   ..- attr(*, "__x")= int [1:100] 35 50 21 78 98 33 9 77 88 94 ...
    write_parquet(dt, "small2.parquet")
    open_dataset("small2.parquet")
    # FileSystemDataset with 1 Parquet file
    # x: int32
    # y: double
    # See $metadata for additional Schema metadata
    

    I suggest (request, even) that you submit a bug report.