Given a data frame (or a data.table
object), say data
, I want to
find the columns which are constant in each of the groups defined by a
factor given as a column of data
. For numeric columns, a "nearly
constant" one should be regarded as constant. A naive implementation
is in the constant_by
function below
is_constant <- function(x, eps = 1e-3) {
if (any(is.na(x))) return(all(is.na(x)))
else if (is.numeric(x)) return(diff(range(x)) < eps)
else return(length(unique(x)) == 1)
}
constant_by <- function(data, by, eps = 1e-3) {
is_constant_by <- function(x) {
tapply(x, data[ , by], is_constant, eps = eps)
}
sapply(data, function(col) all(is_constant_by(col)))
}
The function is_constant
could be made faster, but more importantly
the package data.table could make things much faster.
Here is an example involving seismic data(caution: about 12MO). The data contains variables related to the event (earthquake) and variables related to the seismic station.
library(data.table)
temp <- tempfile()
download.file("https://shake.mi.ingv.it/ita18-flatfile/ITA18_SA_flatfile.zip", temp)
Ita18 <- fread(unzip(temp, "ITA18_SA_flatfile.csv"))
unlink(temp)
## 'station_code' refers to a specific network, so make a proper 'id'
Ita18 <- Ita18[ , station_id := paste(network_code, station_code, sep = "_")]
## naive solution
st <- system.time({
res <- constant_by(as.data.frame(Ita18), by = "station_id")
})
## attempt with data.table. Could be transformed into a function
st2 <- system.time({
res2 <- list()
for (col in names(Ita18)) {
res2[[col]] <- all(Ita18[ , .(test = is_constant(get(col))),
by = "station_id"]$test)
}
})
all.equal(res, unlist(res2))
rbind(st, st2)
My data.table attempt is disappointingly slow. I believe that the
loop should be "inside" the data table by giving a list to the j
index,
but I could not make it work.
Minor modifications and your task runs in less than a second:
is_constant2 <- function(x, eps = 1e-3) {
if (anyNA(x)) all(is.na(x))
else if (is.numeric(x)) (max(x) - min(x)) < eps
else length(unique(x)) == 1
}
res3 <- Ita18[, lapply(.SD, is_constant2), by = station_id] |>
_[, lapply(.SD, all), .SDcols = !"station_id"]
identical(res2[names(res2) != "station_id"], as.list(res3))
# [1] TRUE