For Revolution R Enterprise users, is there a way to apply a function to a factor level of an .xdf, in say, rxCube()
? I know transforms let's you operate on the data pre tabulation, but it seems to me you can only get (count
, sum
, mean
).
For example, I want to find the row that has the minimum value of a particular variable, conditional on a industry * year
.
The only solution I can think of is to rxSplit()
the data, sort by variables you want, and then do what you will. I am sure the reason one can't do this is too many integrity conditions / the supported tabulation functions are actually in fact optimized in C, and using your own function would be more complicated and terribly slow.
It would be amazing to basically have an out-of-memory data.table.
What you describe is not easily doable with a single function from RevoScaleR. What you describe with rxSplit
is one way. Here, is a comparison of the results with that of aggregate
in-memory to show they are the same.
set.seed(1234)
myData <- data.frame(year = factor(sample(2000:2015, size = 100, replace = TRUE)),
x = rnorm(100))
xdfFile <- rxDataStep(inData = myData, outFile = "test.xdf", rowsPerRead = 10)
newDir <- file.path(getwd(), "splits")
dir.create(newDir)
splitFiles <- rxSplit(inData = xdfFile,
outFilesBase = paste0(newDir, "/", gsub(".xdf", "",
basename(xdfFile@file))),
splitByFactor = "year")
minFun <- function(xdf) {
dat <- rxDataStep(inData = xdf, reportProgress = 0)
data.frame(year = dat$year[1], minPos = which.min(dat$x))
}
minPos <- do.call(rbind, lapply(splitFiles, minFun))
row.names(minPos) <- NULL
minPos
aggregate(x ~ year, data = myData, FUN = which.min
The above does assume that the data in each group can fit into RAM. If that is not the case, some tweaking would be required.
There is one other solution given the assumption that the individual groups can fit into RAM, and that is the use of the RevoPemaR
package.
library("RevoPemaR")
rxSort(inData = xdfFile, outFile = xdfFile, sortByVars = "year", overwrite = TRUE)
byGroupPemaObj <- PemaByGroup()
minByYear <- pemaCompute(pemaObj = byGroupPemaObj, data = xdfFile,
groupByVar = "year", computeVars = "x",
fnList = list(
minPos = list(FUN = which.min, x = NULL)))
minPos