I'm wondering how efficiently reticulate
handles memory with python objects.
Suppose I have a 5GB pandas dataframe object called data_pandas
, in reticulate::python
and I'd like to make an analysis with R.
When I call the object from R like py$data_pandas
, does it make a copy of this dataframe into R data.frame object internally (i.e. make another 5GB data.frame in R)?
And vice versa (calling R data.frame from python)?
The answer is no, with careful use of reticulate
package.
With the blessings of the arrow
project, the in-memory data representation between R and Python (and others) is inexpensive.
I'm citing the direct answers from a blog post I found:
if your data are stored as an Arrow Table, and you use the reticulate package to pass it from R to Python (or vice versa), only the metadata changes hands. Because an Arrow Table has the same structure in-memory when accessed from Python as it does in R, the data set itself does not need to be touched at all. The only thing that needs to happen is the language on the receiving end needs to be told where the data are stored. Or, to put it another way, we just pass a pointer across. This all happens invisibly, so if you know how to use reticulate[1], you already know almost everything you need to know and can skip straight to the section on passing Arrow objects.
https://voltrondata.com/resources/passing-arrow-data-between-r-and-python-with-reticulate