rdataframedata.table

Find columns that are constant by groups


Given a data frame (or a data.table object), say data, I want to find the columns which are constant in each of the groups defined by a factor given as a column of data. For numeric columns, a "nearly constant" one should be regarded as constant. A naive implementation is in the constant_by function below

is_constant <- function(x, eps = 1e-3) {
    if (any(is.na(x))) return(all(is.na(x)))
    else if (is.numeric(x)) return(diff(range(x)) < eps) 
    else return(length(unique(x)) == 1) 
}
constant_by <- function(data, by, eps = 1e-3) {
    is_constant_by <- function(x) {
        tapply(x, data[ , by], is_constant, eps = eps)
    }
    sapply(data, function(col) all(is_constant_by(col)))
}

The function is_constant could be made faster, but more importantly the package data.table could make things much faster.

Here is an example involving seismic data(caution: about 12MO). The data contains variables related to the event (earthquake) and variables related to the seismic station.

library(data.table)
temp <- tempfile()
download.file("https://shake.mi.ingv.it/ita18-flatfile/ITA18_SA_flatfile.zip", temp)
Ita18 <- fread(unzip(temp, "ITA18_SA_flatfile.csv"))
unlink(temp)
## 'station_code' refers to a specific network, so make a proper 'id'
Ita18 <- Ita18[ , station_id := paste(network_code, station_code, sep = "_")]
## naive solution
st <- system.time({
    res <- constant_by(as.data.frame(Ita18), by = "station_id")
})
## attempt with data.table. Could be transformed into a function
st2 <- system.time({
    res2 <- list()
    for (col in names(Ita18)) {
        res2[[col]] <- all(Ita18[ , .(test = is_constant(get(col))),
                                 by = "station_id"]$test)
    }
})
all.equal(res, unlist(res2))
rbind(st, st2)

My data.table attempt is disappointingly slow. I believe that the loop should be "inside" the data table by giving a list to the j index, but I could not make it work.


Solution

  • Minor modifications and your task runs in less than a second:

    is_constant2 <- function(x, eps = 1e-3) {
      if (anyNA(x))           all(is.na(x))
      else if (is.numeric(x)) (max(x) - min(x)) < eps
      else                    length(unique(x)) == 1
    }
    
    res3 <- Ita18[, lapply(.SD, is_constant2), by = station_id] |>
      _[, lapply(.SD, all), .SDcols = !"station_id"]
    
    identical(res2[names(res2) != "station_id"], as.list(res3))
    # [1] TRUE