I am trying to turn an .rds
file into a .feather
file for reading with Pandas in Python.
library(feather)
# Set working directory
data = readRDS("file.rds")
data_year = data[["1986"]]
# Try 1
write_feather(
data_year,
"data_year.feather"
)
# Try 2
write_feather(
as.data.frame(as.matrix(data_year)),
"data_year.feather"
)
Try 1 returns Error: 'x' must be a data frame
and Try 2 actually writes a *.feather
file but the file has a size of 4.5GB for a single year whereas the original *.rds
file has a size of 0.055GB for several years.
How can I turn the file into separate or non-separate *.feather
files for each year whilst maintaining an adequate file size?
data
looks like this:
data_year
looks like this:
*Update
I am open to any suggestions for making the data available for use in NumPy/Pandas whilst maintaining a modest file size!
With scipy
and rpy2
, you can read each dgCMatrix
object directly into Python as a scipy.sparse.csc_matrix
object. Both use compressed sparse column (CSC) format, so there is actually zero need for preprocessing. All you need to do is pass the attributes of the dgCMatrix
object as arguments to the csc_matrix
constructor.
To test it out, I used R to create an RDS file storing a list of dgCMatrix
objects:
library("Matrix")
set.seed(1L)
d <- 6L
n <- 10L
l <- replicate(n, sparseMatrix(i = sample(d), j = sample(d), x = sample(d), repr = "C"), simplify = FALSE)
names(l) <- as.character(seq(1986L, length.out = n))
l[["1986"]]
## 6 x 6 sparse Matrix of class "dgCMatrix"
##
## [1,] . . 5 . . .
## [2,] 3 . . . . .
## [3,] . . . . . 6
## [4,] . 2 . . . .
## [5,] . . . . 1 .
## [6,] . . . 4 . .
saveRDS(l, file = "list_of_dgCMatrix.rds")
Then, in Python:
from scipy import sparse
from rpy2 import robjects
readRDS = robjects.r['readRDS']
l = readRDS('list_of_dgCMatrix.rds')
x = l.rx2('1986') # in R: l[["1986"]]
x
## <rpy2.robjects.methods.RS4 object at 0x120db7b00> [RTYPES.S4SXP]
## R classes: ('dgCMatrix',)
print(x)
## 6 x 6 sparse Matrix of class "dgCMatrix"
##
## [1,] . . 5 . . .
## [2,] 3 . . . . .
## [3,] . . . . . 6
## [4,] . 2 . . . .
## [5,] . . . . 1 .
## [6,] . . . 4 . .
data = x.do_slot('x') # in R: x@x
indices = x.do_slot('i') # in R: x@i
indptr = x.do_slot('p') # in R: x@p
shape = x.do_slot('Dim') # in R: x@Dim or dim(x)
y = sparse.csc_matrix((data, indices, indptr), tuple(shape))
y
## <6x6 sparse matrix of type '<class 'numpy.float64'>'
## with 6 stored elements in Compressed Sparse Column format>
print(y)
## (1, 0) 3.0
## (3, 1) 2.0
## (0, 2) 5.0
## (5, 3) 4.0
## (4, 4) 1.0
## (2, 5) 6.0
Here, y
is an object of class scipy.sparse.csc_matrix
. You should not need to use the toarray
method to coerce y
to an array with dense storage. scipy.sparse
implements all of the matrix operations that I can imagine needing. For example, here are the row and column sums of y
:
y.sum(1) # in R: as.matrix(rowSums(x))
## matrix([[5.],
## [3.],
## [6.],
## [2.],
## [1.],
## [4.]])
y.sum(0) # in R: t(as.matrix(colSums(x)))
## matrix([[3., 2., 5., 4., 1., 6.]])