rrevoscaler

RevoScaleR rxDataStep rowselection fails when using variable


I am trying to perform a selection on a xdf file with rxDataStep. I am using rowSelection and it works when I use explicit values but not when I use a variable, e.g.: this works:

tmp <- rxDataStep(alias.Xdf, transforms = list(TT_AMOUNT = DC_AMOUNT * RT_AMOUNT, UNIT_PRICE = RT_VALUE / TT_AMOUNT), varsToKeep = c('DC_AMOUNT', 'RT_AMOUNT', 'RT_VALUE'), 
            rowSelection = (A_ID == 1646041))

but this does not:

x <- 1646041
tmp <- rxDataStep(alias.Xdf, transforms = list(TT_AMOUNT = DC_AMOUNT * RT_AMOUNT, UNIT_PRICE = RT_VALUE / TT_AMOUNT), varsToKeep = c('DC_AMOUNT', 'RT_AMOUNT', 'RT_VALUE'), 
             rowSelection = (A_ID == x))

it fails with the message:

ERROR: The sample data set for the analysis has no variables.
Caught exception in file: CxAnalysis.cpp, line: 3848. ThreadID: 31156 Rethrowing.
Caught exception in file: CxAnalysis.cpp, line: 5375. ThreadID: 31156 Rethrowing.

What is wrong here? I've been strugling with this for hours, tried every single sintax I found on the web. Thanks.


Solution

  • We may need to pass it on the transformObjects

    library(RevoScaleR)
    rxDataStep(alias.Xdf, transforms = list(TT_AMOUNT = DC_AMOUNT * RT_AMOUNT, 
           UNIT_PRICE = RT_VALUE / TT_AMOUNT),
           varsToKeep = c('DC_AMOUNT', 'RT_AMOUNT', 'RT_VALUE'), 
             rowSelection = (A_ID == x1), transformObjects = list(x1=x))
    

    Using a reproducible example

    set.seed(100)
    myData <- data.frame(x = 1:100, y = rep(c("a", "b", "c", "d"), 25),
                     z = rnorm(100), w = runif(100))
    
    z1 <- 2
    
    
    myDataSubset <- rxDataStep(inData = myData,
                           varsToKeep = c("x", "w", "z"),
                    rowSelection = z > zNew,
                            transformObjects = list(zNew=z1))
    #Rows Read: 100, Total Rows Processed: 100, Total Chunk Time: 0.007 seconds 
    myDataSubset
    #   x          w        z
    #1 20 0.03609544 2.310297
    #2 64 0.79408518 2.581959
    #3 96 0.07123327 2.445683
    

    This can be also done with dplyr

    library(dplyr)
    myData %>%
          select(x, w, z) %>%
          filter(z > z1)
    #   x          w        z
    #1 20 0.03609544 2.310297
    #2 64 0.79408518 2.581959
    #3 96 0.07123327 2.445683