revolution-r

rxDataStep using lagged values


In SAS its possible to go through a dataset and used lagged values.

The way I would do it is to use a function that does a "lag", but this presumably would produce a wrong value at the beginning of a chunk. For example if a chunk starts at row 200,000, then it will assume an NA for a lagged value that should come instead from row 199,999.

Is there a solution for this?


Solution

  • Here's another approach for lagging: self-merging using a shifted date. This is dramatically simpler to code and can lag several variables at once. The downsides are that it takes 2-3 times longer to run than my answer using transformFunc, and requires a second copy of the dataset.

    # Get a sample dataset
    sourcePath <- file.path(rxGetOption("sampleDataDir"), "DJIAdaily.xdf")
    
    # Set up paths for two copies of it
    xdfPath <- tempfile(fileext = ".xdf")
    xdfPathShifted <- tempfile(fileext = ".xdf")
    
    
    # Convert "Date" to be Date-classed
    rxDataStep(inData = sourcePath,
               outFile = xdfPath,
               transforms = list(Date = as.Date(Date)),
               overwrite = TRUE
    )
    
    
    # Then make the second copy, but shift all the dates up 
    # one (or however much you want to lag)
    # Use varsToKeep to subset to just the date and 
    # the variables you want to lag
    rxDataStep(inData = xdfPath,
               outFile = xdfPathShifted,
               varsToKeep = c("Date", "Open", "Close"),
               transforms = list(Date = as.Date(Date) + 1),
               overwrite = TRUE
    )
    
    # Create an output XDF (or just overwrite xdfPath)
    xdfLagged2 <- tempfile(fileext = ".xdf")
    
    # Use that incremented date to merge variables back on.
    # duplicateVarExt will automatically tag variables from the 
    # second dataset as "Lagged".
    # Note that there's no need to sort manually in this one - 
    # rxMerge does it automatically.
    rxMerge(inData1 = xdfPath,
            inData2 = xdfPathShifted,
            outFile = xdfLagged2,
            matchVars = "Date",
            type = "left",
            duplicateVarExt = c("", "Lagged")
    )