I'd like to process Apache Parquet files (in my case, generated in Spark) in the R programming language.
Is an R reader available? Or is work being done on one?
If not, what would be the most expedient way to get there? Note: There are Java and C++ bindings: https://github.com/apache/parquet-mr
The simplest way to do this is to use the arrow
package for this, which is available on CRAN.
install.packages("arrow")
library(arrow)
read_parquet("somefile.parquet")
Previously this could be done through Python using pyarrow
but this nowadays also comes packaged for R without the need for Python.
If you do not want to install from CRAN you can build directly, or install from GitHub:
git clone https://github.com/apache/arrow.git
cd arrow/cpp && mkdir release && cd release
# It is important to statically link to boost libraries
cmake .. -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
make install
Then you can install the R arrow
package:
devtools::install_github("apache/arrow/r")
And use it to load a Parquet file
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
#> The following objects are masked from 'package:base':
#>
#> array, table
read_parquet("somefile.parquet", as_tibble = TRUE)
#> # A tibble: 10 x 2
#> x y
#> <int> <dbl>
#> …