rdplyrstandard-evaluation

Why do i got different results using SE or NSE dplyr functions


Hi I got differents results from dplyr function when I use standard evaluation through lazyeval package.

Here is how to reproduce something close to my real datas with 250k rows and about 230k groups. I would like to group by id1, id2 and subset the rows with the max(datetime) for each group.

library(dplyr)
# random datetime generation function by Dirk Eddelbuettel
# http://stackoverflow.com/questions/14720983/efficiently-generate-a-random-sample-of-times-and-dates-between-two-dates
rand.datetime <- function(N, st = "2012/01/01", et = "2015/08/13") {
  st <- as.POSIXct(as.Date(st))
  et <- as.POSIXct(as.Date(et))
  dt <- as.numeric(difftime(et,st,unit="sec"))
  ev <- sort(runif(N, 0, dt))
  rt <- st + ev
}

set.seed(42)
# Creating 230000 ids couples
ids <- data_frame(id1 = stringi::stri_rand_strings(23e4, 9, pattern = "[0-9]"), 
                  id2 = stringi::stri_rand_strings(23e4, 9, pattern = "[0-9]"))
# Repeating randomly the ids[1:2000, ] to create groups    
ids <- rbind(ids, ids[sample(1:2000, 20000, replace = TRUE), ])
datas <- mutate(ids, datetime = rand.datetime(25e4))

When I use the NSE way I got 230000 rows

df1 <- 
  datas %>% 
  group_by(id1, id2) %>% 
  filter(datetime == max(datetime))
nrow(df1) #230000

But when I use the SE, I got only 229977 rows

ids <- c("id1", "id2")
filterVar <- "datetime"
filterFun <- "max"
df2 <- 
  datas %>% 
  group_by_(ids) %>% 
  filter_(.dots = lazyeval::interp(~var == fun(var), 
                                   var = as.name(filterVar), 
                                   fun = as.name(filterFun)))
nrow(df2) #229977

My two pieces of code are equivalent right ? Why do I experience different results ? Thanks.


Solution

  • You'll need to specify the .dots argument in group_by_ when giving a vector of column names.

    df2 <- datas %>% 
        group_by_(.dots = ids) %>% 
        filter_(.dots = lazyeval::interp(~var == fun(var), 
                                   var = as.name(filterVar), 
                                   fun = as.name(filterFun)))
    nrow(df2)
    [1] 230000
    

    It looks like group_by_ might take the first column name from the vector as the only grouping variable when you don't specify the .dots argument. You can check this by grouping on id1 only.

    df1 <- datas %>% 
        group_by(id1) %>% 
        filter(datetime == max(datetime))
     nrow(df1)
    [1] 229977
    

    (If you group just on id2 the number of rows is 229976).