rff

vlookup method for ffdf object in R


I have an ffdf object called 'data' with over 26 million rows that looks like this:

       Location                  DateandTime     Value
1             1          01/01/2012 00:00:00     0.8              
2            42          01/01/2012 00:00:00     0.4             
3            14          01/01/2012 00:00:00     0.7              
4            21          01/01/2012 00:00:00     0.2   

I would like to add a fourth column of data called 'Group' based on the values in a smaller normal data frame 'lookup' that looks like this :

       Location      Group
1             1          1             
2             2          2         
3             3          8          
4             4          7  

So I want the new column in 'data' to have the lookup$Group values. I know this can be done with vlookup in excel, and I have found functions that can work with normal df in R such as addNewData.r. But how can this be done specifically for ffdf objects?


Solution

  • There are lots of ways that you can do this type of join.

    In R you can use merge or you can use SQL via the package sqldf, just to name a couple. Here's an example:

    require(ff)
    
    mydf <- data.frame(Location = seq(1:10), 
                     DateandTime =  seq(as.Date(Sys.Date()), by="days", 
                                        length=10),
                     Value = rnorm(10))
    
    
    lookup <- data.frame(Location = seq(1:10),
                      Group = seq(20,29))
    lookup
    
    mydf <- as.ffdf(mydf) # you can make them both ffdf or just one and it still works
    
    df2 <- merge(mydf,lookup, by = "Location")
    df2
    
       Location DateandTime      Value Group
    1         1  2016-06-26  0.6229381    20
    2         2  2016-06-27  1.0009087    21
    3         3  2016-06-28  1.1993809    22
    4         4  2016-06-29  0.8809430    23
    5         5  2016-06-30 -0.4233689    24
    6         6  2016-07-01 -0.7101273    25
    7         7  2016-07-02  0.4404004    26
    8         8  2016-07-03  1.5120004    27
    9         9  2016-07-04  0.5564032    28
    10       10  2016-07-05  0.4839012    29
    

    On a minor side note it's a best practice not to name your data "data" because, asides from being confusing, there's a function named data that's loaded to the global environment by default.