pythonranalyticsrpy2

How to load a RTYPES.NILSXP data object when using rpy2?


rpy2 allows me to use some but not all of the returned values from a function (dea()) in library(Benchmarking) from R in Python because some of them return RTYPES.NILSXP instead of an ndarray or int. How do I get these data out from RTYPES.NILSXP object?

Set-up:

#imports 
import pandas as pd

import rpy2
import rpy2.robjects as robjects
import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector
from rpy2.robjects.packages import importr
utils = rpackages.importr('utils')
utils.chooseCRANmirror(ind=1)

#helps work with pd.DataFrame objects
from rpy2.robjects import pandas2ri

from rpy2.robjects.conversion import localconverter


#import the R library into Python
packnames = ('Benchmarking')
utils.install_packages(StrVector(packnames))

Benchmarking = importr('Benchmarking')
base = importr('base')

data = pd.read_csv("path/to_data.csv")

#call the function I want and store it in a variable
with localconverter(robjects.default_converter + pandas2ri.converter):
  crs = Benchmarking.dea(data['Age'], data['CO2'], RTS='crs', ORIENTATION='in')

crs['eff'] or crs['lambda'] work fine and return ndarrays

crs
____________________________________________________________________
o{'eff': [1.    0.625 0.5  ], 'lambda': [[1.   0.   0.  ]
 [1.25 0.   0.  ]
 [1.5  0.   0.  ]], 'objval': [1.    0.625 0.5  ], 'RTS': [1] "crs"
, 'primal': <rpy2.rinterface_lib.sexp.NULLType object at 0x00000220BCB0D1C0> [RTYPES.NILSXP], 'dual': <rpy2.rinterface_lib.sexp.NULLType object at 0x00000220BCB0D1C0> [RTYPES.NILSXP], 'ux': <rpy2.rinterface_lib.sexp.NULLType object at 0x00000220BCB0D1C0> [RTYPES.NILSXP], 'vy': <rpy2.rinterface_lib.sexp.NULLType object at 0x00000220BCB0D1C0> [RTYPES.NILSXP], 'gamma': function (x)  .Primitive("gamma")
, 'ORIENTATION': [1] "in"
, 'TRANSPOSE': [1] FALSE
, 'param': <rpy2.rinterface_lib.sexp.NULLType object at 0x00000220BCB0D1C0> [RTYPES.NILSXP], }

So far so good.

However there is more useful data that I would like to extract eg.

crs['dual']
_______________________________________________________________
<rpy2.rinterface_lib.sexp.NULLType object at 0x00000220BCB0D1C0> [RTYPES.NILSXP]

What kind of object is this? <>

Searching up RTYPES.NILSXP in the 3.5.3 docs takes me to a page in the docs which is the only mention I have found.

I have no idea how to read this. The docs explains that datasets can be serialised R objects or serialised R code that produces the dataset. rpy2 employs 'lazy loading' and to load the data, one must use the method fetch() but I don't seem to be able to use it correctly to load the rest of the outputs from dea(x, y, *args)

Failed attempts to load data


rpy2.robjects.packages.PackageData.fetch(crs['dual'])
_______________________________________________________________
TypeError: PackageData.fetch() missing 1 required positional argument: 'name'

I've found fetch() method belongs to PackageData. I've tried to call it but now it asks me for the 'name' of this dataset?? I thought crs['dual'] was enough information. When I pass in 'dual' as the name parameter I get

rpy2.robjects.packages.PackageData.fetch(r_from_df_crs['dual'], 'dual')

File ~\anaconda3\envs\UROP_buildings_env\lib\site-packages\rpy2\robjects\packages.py:143, in PackageData.fetch(self, name)
    136 def fetch(self, name):
    137     """ Fetch the dataset (loads it or evaluates the R associated
    138     with it.
    139 
    140     In R, datasets are loaded into the global environment by default
    141     but this function returns an environment that contains the dataset(s).
    142     """
--> 143     if self._datasets is None:
    144         self._init_setlist()
    146     if name not in self._datasets:

AttributeError: 'NULLType' object has no attribute '_datasets

so I am stuck. How can I deserialise this <RTYPES.NILSXP> object from memory?

Edit:

So taking into account @igautier's answer, I've used the rpackages.data() method to instantiate the PacketData class and enable access to .fetch()

my_crs = rpackages.data(crs)
_________________________________
AttributeError                            Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 my_crs = rpackages.data(crs)

File ~\anaconda3\envs\UROP_buildings_env\lib\site-packages\rpy2\robjects\packages.py:522, in data(package)
    520 def data(package):
    521     """ Return the PackageData for the given package."""
--> 522     return package.__rdata__

AttributeError: 'OrdDict' object has no attribute '__rdata__'

How can I now convert my 'OrdDict' into PacketData?


In a nutshell, the solution was that I had omitted the SLACK=True and DUAL=True arguments in the dea() function and hence the dual and slack results were given as null


Solution

  • However there is more useful data that I would like to extract eg.

    crs['dual']
    _______________________________________________________________
    <rpy2.rinterface_lib.sexp.NULLType object at 0x00000220BCB0D1C0> [RTYPES.NILSXP]
    

    What kind of object is this? <>

    This is R's NULL object. This is pretty much like a NULL in C, or a null in Java or Javascript. With the increasing use of Python for data science, None came to mean either the equivalent of a NULL, or a missing value (which is an NA or variant in R). The default conversion is returning an R NULL rather than convert to None to avoid confusions. This is an early decision though. If the need to revisit this an issue should be opened on the project's page (on Github).

    Searching up RTYPES.NILSXP in the 3.5.3 docs takes me to a page in the docs which is the only mention I have found.

    The mention on that page refers to the default value for a named argument: https://rpy2.github.io/doc/latest/html/robjects_rpackages.html#rpy2.robjects.packages.PackageData I see that the documentation is incomplete. lib_loc is an optional path for the class constructor indicating where the R package is installed.

    I have no idea how to read this. The docs explains that datasets can be serialised R objects or serialised R code that produces the dataset. rpy2 employs 'lazy loading' and to load the data, one must use the method fetch() but I don't seem to be able to use it correctly to load the rest of the outputs from dea(x, y, *args)

    What is meant here is that "data" objects in R packages are not necessarily serialized R data structures like the R functions save(), dump(), or dput() can help produce. They can also be R scripts. For example, an R package can have a data object "myrandnorm100" that is an R script data/myrandnorm100.R in the installed package's directory and that script will be evaluated using the R function source() (see https://rdrr.io/r/utils/data.html). That script can define an arbitrary number of variables. Note that serialized R data (for example in an .RData file can also contain several named objects). The design choice for rpy2 was to try make things a little safer and predictable by keeping those names within a namespace. Silent name clashes can be at the root of challenging bugs in code.

    (...)

    rpy2.robjects.packages.PackageData.fetch(crs['dual'])


    TypeError: PackageData.fetch() missing 1 required positional argument: 'name'

    I've found fetch() method belongs to PackageData. I've tried to call it but now it asks me for the 'name' of this dataset??

    Yes. The PackageData object is like a namespace with as many named objects as the author of the R package wanted to include.

    I thought crs['dual'] was enough information. When I pass in 'dual' as the name parameter I get rpy2.robjects.packages.PackageData.fetch(r_from_df_crs['dual'], 'dual')

    File ~\anaconda3\envs\UROP_buildings_env\lib\site-packages\rpy2\robjects\packages.py:143, in PackageData.fetch(self, name)
        136 def fetch(self, name):
        137     """ Fetch the dataset (loads it or evaluates the R associated
        138     with it.
        139 
        140     In R, datasets are loaded into the global environment by default
        141     but this function returns an environment that contains the dataset(s).
        142     """
    --> 143     if self._datasets is None:
        144         self._init_setlist()
        146     if name not in self._datasets:
    
    AttributeError: 'NULLType' object has no attribute '_datasets>
    

    Well, this is not how DataPackage objects can be instanciated. I'll use the R package datasets as an example since it is part of the R standard library.

    # Import the R package "datasets"
    datasets = importr('datasets')
    # That package only contains datasets. The Python object will look
    # like it has no (useful) attributes. We can create an instance
    # for the data in the package with:
    datasets_data = rpackages.data(datasets)
    # All dataset names are available through `datasets_data.names()`.
    # We know the name of the one we want.
    mtcars_env = datasets_data.fetch('mtcars')
    # mtcars_env is an R "environment", wrapped as an `rpy2.robjects.Environment`
    

    Note: I am seeing now that the doc has that information, but available throughout scattered examples rather also on the page about packages (see https://rpy2.github.io/doc/latest/html/search.html?q=data+fetch).

    However, what you have here is not an R package dataset. Your object crs is the result of calling a function:

    crs = Benchmarking.dea(data['Age'], data['CO2'],
                           RTS='crs', ORIENTATION='in')
    

    There is likely no element named "dual" in crs, which a look at the documentation for that function seems to confirm.