I am dealing with time series data for different individuals in a wide format. The number of time points differ between individuals. Now, the thing is that I need the last element for each individual.
I was thinking of using a list as a column in my tibble to store the time series sequences. (Putting each time point into a different column is probably not a good idea, since there can be hundreds of possible time points, but an individual could have data only for a handful of them, however, the data per individual is always measured for consecutive time points.)
Let's call it column1, i.e.:
library(tibble)
# Create an example dataframe
df <- tibble(
column1 = list(1:3, 1:4, 4:8)
)
Now, I would like to use a vectorization for the sake of efficiency and speed, but is it even possible with the given data structure. There is a function called map() in purrr package, with which the operation would go like:
library(purrr)
# Use the map function to select the last element of each vector
last_elements <- map(df$column1, ~ .x[length(.x)])
But this is not vectorization, but rather looping through the elements of the list (stored as column1), right?
Would there be a better (i.e. faster / more efficient) choice for a data structure than a list as a column? Or is this in general the best way to handle this kind of situation?
It is important to understand that vectorization
in R
means that you do not need a loop, b/c somebody else did the loop already for you. Important part is that eventually there will be some sort of loop.
The reason why "often" relying on out-of-the-shelf vectorization is faster than writing your loop yourself is that the loop in the former case is done on C
level, which can be faster, but sometimes the difference is not even noticeable:
library(microbenchmark)
x <- seq_len(10000000L)
microbenchmark(
vec = sum(x),
lp = {sm <- 0; for(i in seq_along(x)) sm <- sm + x[i]; sm}
)
# Unit: nanoseconds
# expr min lq mean median uq max neval cld
# vec 100 200 2745 350 4450 19300 100 a
# lp 327515400 340921350 367242236 351582100 383396050 603997400 100 b
Looks like a lot, but be aware of the unit, so basically the loop finishes in 0.32sec
while the R
loop takes 0.0000001sec
. In relative terms this is a huge improvement, but in absolute terms you barely will see a difference.
Bottom line is that you should not be afraid of R
-level loops, b/c they are far better than their reputation. If there's a natural replacement for a loop, go for it by all means, but do not hesitate to use them when the price for avoiding them is an over-bloated data structure.
Now to your problem at hand.
You could transform your data into long format and take the last row by individual. You can go even further and use data.table
with an appropriate key:
library(microbenchmark)
library(tibble)
library(tidyr)
library(dplyr)
library(data.table)
create_data <- function(nr_indiv, max_per_indiv) {
l <- replicate(nr_indiv, sample(10000L, sample(max_per_indiv, 1L)))
tibble(id = seq_len(nr_indiv), load = l)
}
set.seed(20022024)
list_data <- create_data(10000, 1000)
lng_data <- unnest(list_data, cols = load)
dt_data <- copy(lng_data)
setDT(dt_data)
setkey(dt_data, "id")
my_check <- function(x) {
isTRUE(all.equal(x$lst, x$lng %>% pull(load), check.attributes = FALSE)) &&
isTRUE(all.equal(x$lst, x$dt[, load], check.attributes = FALSE)) &&
isTRUE(all.equal(x$lst, c(x$ta), check.attributes = FALSE)) &&
isTRUE(all.equal(x$lst, x$base, check.attributes = FALSE)) &&
isTRUE(all.equal(x$lst, x$flt %>% pull(load), check.attributes = FALSE))
}
microbenchmark(
lst = map_int(list_data$load, tail, n = 1L),
lng = slice_tail(lng_data, n = 1L, by = id),
dt = dt_data[, .(load = tail(load, 1L)), "id"],
ta = tapply(lng_data$load, lng_data$id, tail, n = 1L),
base = vapply(list_data$load, tail, integer(1L), n = 1L),
flt = lng_data %>% filter(ord == 1L),
check = my_check
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# lst 70.5403 77.40405 93.87981 87.87985 101.53635 280.5060 100 a
# lng 916.2937 971.14420 1053.53934 1004.19125 1133.12665 1434.2933 100 b
# dt 18.4768 19.47870 28.12581 20.64565 37.49055 64.8045 100 c
# ta 315.0710 357.06000 407.04557 368.03110 453.28390 687.4518 100 d
# base 65.5528 75.40010 85.31279 85.87180 94.14265 116.5428 100 a
# flt 44.4467 47.46310 55.71809 50.85525 57.62820 244.5270 100 e
The tidyverse
long format solution is worse than the purrr
loop, the data.table
long format solution, is better (not accounting for the overhead of creating the data structure), the filter method lies in between and base loops are at par with purrr
so even the slight overhead of purrr
is negligible.
There will be even more solutions, but from this first quick test, it seems to me that the list
solution is not that bad overall, and, in my opinion equally (if not even more) importantly, quite easy to read and understand.
TL;DR: Don't be afraid of loops, if there is no straight forward natural vectorization method use them by all means.