After reading in a .csv file using pandas, and then converting it into an R dataframe using the rpy2 package, I created a model using some R functions (also via rpy2), and now want to take the summary of the model and convert it into a Pandas dataframe (so that I can either save it as a .csv file or use it for other purposes).
I have followed out the instructions on the pandas site (source: https://pandas.pydata.org/pandas-docs/stable/r_interface.html) in order to figure it out:
import pandas as pd
from rpy2.robjects import r
import sys
import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector
from rpy2.robjects import r, pandas2ri
pandas2ri.activate()
caret = rpackages.importr('caret')
broom= rpackages.importr('broom')
my_data= pd.read_csv("my_data.csv")
r_dataframe= pandas2ri.py2ri(my_data)
preprocessing= ["center", "scale"]
center_scale= StrVector(preprocessing)
#these are the columns in my data frame that will consist of my predictors in the model
predictors= ['predictor1','predictor2','predictor3']
predictors_vector= StrVector(predictors)
#this column from the dataframe consists of the outcome of the model
outcome= ['fluorescence']
outcome_vector= StrVector(outcome)
#this line extracts the columns of the predictors from the dataframe
columns_predictors= r_dataframe.rx(True, columns_vector)
#this line extracts the column of the outcome from the dataframe
column_response= r_dataframe.rx(True, column_response)
cvCtrl = caret.trainControl(method = "repeatedcv", number= 20, repeats = 100)
model_R= caret.train(columns_predictors, columns_response, method = "glmStepAIC", preProc = center_scale, trControl = cvCtrl)
summary_model= base.summary(model_R)
coefficients= stats.coef(summary_model)
pd_dataframe = pandas2ri.ri2py(coefficients)
pd_dataframe.to_csv("coefficents.csv")
Although this workflow is ostensibly correct, the output .csv file did not meet my needs, as the names of the columns and rows were removed. When I ran the command type(pd_dataframe)
, I find that it is a <type 'numpy.ndarray'>
. Although the information of the table is still present, the new formatting has removed the names of the columns and rows.
So I ran the command type(coefficients)
and found that it was a <class 'rpy2.robjects.vectors.Matrix'>
. Since this Matrix object still retained the names of my columns and rows, I tried to convert it into an R objects DataFrame, but my efforts proved to be futile. Furthermore, I don't know why the line pd_dataframe = pandas2ri.ri2py(coefficients)
did not yield a pandas DataFrame object, nor why it did not retain the names of my columns and rows.
Can anybody recommend an approach so I can get some kind of pandas DataFrame that retains the names of my columns and rows?
UPDATE
A new method was mentioned in the documents of a slightly older version of the package called pandas2ri.ri2py_dataframe
(source: https://rpy2.readthedocs.io/en/version_2.7.x/changes.html), and now I have a proper data frame instead of the numpy array. However, I still can't get the names of the rows and columns to be transferred properly. Any suggestions?
May be it should happen automatically during conversion, but in the meantime row and column names can easily be obtained from the R object and added to the pandas DataFrame
. For example the column names for the R matrix should be at: https://rpy2.github.io/doc/v2.9.x/html/vector.html#rpy2.robjects.vectors.Matrix.colnames