rdata.tablecustom-function

Data.table with a custom function


I am new to data.table, coming from dplyr. I have the following custom function tabs:

tabs <- function(dt, x) {
tab2 <- dt[!is.na(x), ][, .(Freq = sum(nwgt0)), by = .(inc_cat, year, x)][, Prop := Freq / sum(Freq), by= .(inc_cat, year)][order(inc_cat, year)][x == 1 & !is.na(inc_cat), ] %>%
   ggplot(., aes(x= year, y = Prop, color = factor(inc_cat, levels = c(1,2,3,4),labels = c("0% to 100% FPL", "101-138% FPL", "139-200% FPL", ">200% FPL")))) +
    labs(color = "Income Categories") +
    geom_line() +
    theme_minimal() +
  ylab("Weighted proportion") +
   theme(
  panel.border = element_blank(),
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  )
return(tab2)
}

I now wish to call the function tabs .

I have tried the following (does not work):

result <- hints_dt[ , tabs(.SD, x='internet_use')]

And receive the following error:

Error in `[.data.table`(dt[!is.na(x), ], , .(Freq = sum(nwgt0)), by = .(inc_cat,  : 
  The items in the 'by' or 'keyby' list are length(s) (22344,22344,1). Each must be length 22344; the same length as there are rows in x (after subsetting if i is provided).

Should be using .SDcols to specify the column internet_use. If so, how do I modify my function?

Thanks,

Felippe

EDIT: per comments below, I include a reprex here. Using data from NHANES data("nhanes") I adapted the function tabs:

tabs <- function(dt, x) {
tab2 <- dt[!is.na(x), ][, .(Freq = sum(WTMEC2YR)), by = .(race, agecat, x)][, Prop := Freq / sum(Freq), by= .(race, agecat)][order(race, agecat)][x == 1 & !is.na(race), ] %>%
   ggplot(., aes(x= year, y = Prop, color = factor(race, levels = c(1,2,3,4),labels = c("hispanic", "white", "black", "other")))) +
    labs(color = "Race") +
    geom_line() +
    theme_minimal() +
  ylab("Weighted proportion") +
   theme(
  panel.border = element_blank(),
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  )
return(tab2)
}

When I run result <- nhanes[ , tabs(.SD, x="RIAGENDR")] I was able to reproduce my error:

Error in `[.data.table`(dt[!is.na(get(x)), ], , .(Freq = sum(WTMEC2YR)),  : 
  The items in the 'by' or 'keyby' list are length(s) (8591,8591,1). Each must be length 8591; the same length as there are rows in x (after subsetting if i is provided).

Solution

  • get(x) works fine for the LHS/RHS of the data.table::`:=` operator,

    MT <- as.data.table(mtcars)
    fun <- function(DT, v) DT[!(get(v) == 4),]
    fun(MT, "cyl") # WORKS
    

    But your use of non-standard evaluation (NSE) within by= will not work with this.

    Note: for the sake of this argument, I'm mimicking your code by having the function have built-in by-grouping hard-coded. If the function can only be used with a specific dataset then this is often fine, but if you try to generalize the function, know that you should "never" assume fields in more general calls on other data.

    fun2 <- function(DT, v, by) DT[, lapply(.SD, sum), .SDcols = v, by = .(gear, by)][]
    fun2(MT, v="disp", by="cyl")
    # Error in `[.data.table`(DT, , lapply(.SD, sum), .SDcols = v, by = .(gear,  : 
    #   The items in the 'by' or 'keyby' list are length(s) (32,1). Each must be length 32; the same length as there are rows in x (after subsetting if i is provided).
    

    We can use get(by) within the NSE by= as well,

    fun2 <- function(DT, v, by) DT[, lapply(.SD, sum), .SDcols = v, by = .(gear, get(by))][]
    fun2(MT, v="disp", by="cyl") # works
    

    But this may not always be the case. I find in these situations it is often good to recall that by= can be either the NSE that you're using or a character vector.

    fun2 <- function(DT, v, by) DT[, lapply(.SD, sum), .SDcols = v, by = c("gear", by)][]
    fun2(MT, v="disp", by="cyl") # works
    

    using by=c(..) instead of by=.(..). This can also work with inequality joins, where data.table internally parses and evaluates them, such as by=c("gear", paste(v, ">", otherv)) (assuming we have another variable otherv for the join-comparison).

    From here, whatever else you do in the rest of the function should attempt to do the same thing: use v as a character vector.

    Note that I have setup this function so that v can be length-1 or more.