pythonrparquetpyarrowapache-arrow

Open with Python an R data.table saved as metadata in a Parquet file


With R, I created a Parquet file containing a data.table as main data, and another data.table as metadata.

library(data.table)
library(arrow)
dt = data.table(x = c(1, 2, 3), y = c("a", "b", "c"))
dt2 = data.table(a = 22222, b = 45555)

attr(dt, "dt_meta") = dt2
tb = arrow_table(dt)
tb$metadata

write_parquet(tb, "file.parquet")

Attributes/metadata can be accessed easily when loading the Parquet file in R:

dt = open_dataset("file.parquet")
dt$metadata$r$attributes$dt_meta

dt2 = read_parquet("file.parquet")
attributes(dt2)$dt_meta

Now I wonder if it is also possible to retrieve the data.table (or data.frame) from metadata of the Parquet file in Python.

Metadata can be accessed in Python with the pyarrow library, and the r field is there, but not correctly decoded.

import pyarrow.parquet as pq
mt = pq.read_metadata("file.parquet")
metadata = mt.metadata[b'r']
metadata

Result:

b'A\n3\n263169\n197888\n5\nUTF-8\n531\n2\n531\n3\n16\n2\n262153\n10\ndata.table\n262153\n10\ndata.frame\n22\n22\n254\n254\n16\n2\n262153\n1\nx\n262153\n1\ny\n787\n2\n14\n1\n22222\n14\n1\n45555\n1026\n1\n262153\n5\nnames\n16\n2\n262153\n1\na\n262153\n1\nb\n1026\n1\n262153\n9\nrow.names\n13\n2\nNA\n-1\n1026\n1\n262153\n5\nclass\n16\n2\n262153\n10\ndata.table\n262153\n10\ndata.frame\n1026\n1\n262153\n17\n.internal.selfref\n22\n22\n254\n254\n16\n2\n262153\n1\na\n262153\n1\nb\n254\n1026\n1023\n16\n3\n262153\n5\nclass\n262153\n17\n.internal.selfref\n262153\n7\ndt_meta\n254\n531\n2\n254\n254\n1026\n1023\n16\n2\n262153\n1\nx\n262153\n1\ny\n254\n1026\n1023\n16\n2\n262153\n10\nattributes\n262153\n7\ncolumns\n254\n'

Is it still an R attribute object, or another encoded object?

The names of the different attributes (e.g. dt_meta) can be read in this resulting string but is it possible to fully decode and parse it to retrieve the dt_meta table as a DataFrame?


Solution

  • See https://arrow.apache.org/docs/r/articles/metadata.html

    Note that the attributes stored in $metadata$r are only understood by R. If you write a data.frame with haven columns to a Feather file and read that in Pandas, the haven metadata won’t be recognized there. Similarly, Pandas writes its own custom metadata, which the R package does not consume. You are free, however, to define custom metadata conventions for your application and assign any (string) values you want to other metadata keys.

    So what you are looking for is not possible. But it is also not advisable for other reasons as saving actual data (as you want to do) in the meta data loses all the benefits that the columnar format of arrow data/parquet gives you. You won't notice this with this small example but I am guessing that you want to work with more than 3 values :D

    I am unclear what you are aiming to achieve, why not just cbind the data.tables and save them in one file?