I've got timestamped data recording how many observations of a certain type were made at a certain time. For the sake of illustration, suppose I'm a birdwatcher observing flocks of duck and geese, recording -- each time I observe a flock -- the time of observation, the number of birds in the flock, and the type of bird.
What I would like to do is aggregate this data and record how many birds of each type were observed, in total, in each 30-minute period. I've found the tsibble
package from tidyverts
(not to be confused with tidyverse
); tsibbles are (basically) time-series tibbles and can be grouped (e.g. by bird) as well as re-indexed.
Indices in tsibbles must be unique (that's timestamps in my case); that is understandable. However, this is enforced before it's possible to use group_by()
. If I happen to observe both ducks and geese at exactly the same time, I'm apparently out of luck.
Here is some sample code showing the problem:
library(tidyverse) # includes lubridate
library(tsibble)
N <- 100
set.seed(42)
# suppose we're observing ducks and geese between 8:00 and 18:00.
d <- as_datetime("2023-03-08 08:00:00")
times <- d + seconds(unique(round(sort(runif(N, min = 0, max = 36e3)))))
nObs <- 1 + rpois(length(times), lambda = 1)
birdIdx <- 1 + round(runif(length(times)))
birds <- c("Duck", "Goose")[birdIdx]
# Tibble of observations
waterfowl <- tibble(Timestamp = times, Count = nObs, Bird = as_factor(birds))
# Convert to tsibble (time series tibble) and aggregate on a 30-minute basis
waterfowl |>
as_tsibble(index = Timestamp) |>
group_by(Bird) |>
index_by(~ floor_date(., "30 minute")) |>
summarize(`Total birds` = sum(Count)) |>
print(n = Inf)
# Let's create a collision by observing both ducks and geese at the same time.
waterfowl |>
bind_rows(tibble(Timestamp = times[[1]], Count = 1, Bird = c("Duck", "Goose")[3 - birdIdx[[1]]])) |>
as_tsibble(index = Timestamp) |>
group_by(Bird) |>
index_by(~ floor_date(., "30 minute")) |>
summarize(`Total birds` = sum(Count)) |>
print(n = Inf)
The first conversion runs fine since there are, in fact, no duplicates in the observation timestamps; the unique()
when creating the times
vector guarantees this.
The second conversion where I deliberately introduce a duplicate timestamp does not, and instead produces
Error in `validate_tsibble()`:
! A valid tsibble must have distinct rows identified by key and index.
ℹ Please use `duplicates()` to check the duplicated rows.
Is there a way around this? How do I handle this in an efficient manner, one that I can use when I have tens of millions of rows and several grouping variables with many levels (and with many combinations of levels not occurring)?
Thank you!
The {tsibble}
package is strict when it comes to measurements at points in time. Each series should be uniquely identified in time, if there are duplicate time points it is like saying we at exactly 2023-03-08 08:04:44
there were 2 geese and 3 geese. It is unclear which is accurate, but it isn't both 2 and 3 geese at that point in time. Your specific example is relatively easy to fix, as you are observing two different series - the number of geese and the number of ducks. You can identify each series by key
variables.
# as above from MRE
# Let's create a collision by observing both ducks and geese at the same time.
waterfowl_tsbl <- waterfowl |>
bind_rows(tibble(Timestamp = times[[1]], Count = 1, Bird = c("Duck", "Goose")[3 - birdIdx[[1]]])) |>
as_tsibble(index = Timestamp, regular = FALSE, key = Bird)
waterfowl_tsbl
#> # A tsibble: 100 x 3 [!] <UTC>
#> # Key: Bird [2]
#> Timestamp Count Bird
#> <dttm> <dbl> <chr>
#> 1 2023-03-08 08:00:09 1 Duck
#> 2 2023-03-08 08:04:44 4 Duck
#> 3 2023-03-08 08:22:28 4 Duck
#> 4 2023-03-08 08:23:22 3 Duck
#> 5 2023-03-08 08:49:28 2 Duck
#> 6 2023-03-08 08:51:22 1 Duck
#> 7 2023-03-08 08:53:59 2 Duck
#> 8 2023-03-08 09:23:14 2 Duck
#> 9 2023-03-08 09:24:17 2 Duck
#> 10 2023-03-08 09:34:45 2 Duck
#> # ... with 90 more rows
Created on 2023-03-09 with reprex v2.0.2
Note that I've used the Bird
column as the key
variable, and I've set the time interval to be irregular as you have irregular observations (important for handling implicit missing values and producing forecasts).
If you were more interested in the total geese and ducks over 30 minute intervals, you can now summarise the tsibble as you have done above:
waterfowl_tsbl |>
group_by(Bird) |>
index_by(Time = floor_date(Timestamp, "30 minute")) |>
summarize(`Total birds` = sum(Count)) |>
print(n = Inf)
#> # A tsibble: 36 x 3 [30m] <UTC>
#> # Key: Bird [2]
#> Bird Time `Total birds`
#> <chr> <dttm> <dbl>
#> 1 Duck 2023-03-08 08:00:00 12
#> 2 Duck 2023-03-08 08:30:00 5
#> 3 Duck 2023-03-08 09:00:00 4
#> 4 Duck 2023-03-08 09:30:00 4
#> 5 Duck 2023-03-08 10:00:00 4
#> 6 Duck 2023-03-08 10:30:00 6
#> 7 Duck 2023-03-08 11:00:00 8
#> 8 Duck 2023-03-08 11:30:00 10
#> 9 Duck 2023-03-08 12:00:00 1
#> 10 Duck 2023-03-08 12:30:00 6
#> 11 Duck 2023-03-08 13:00:00 6
#> 12 Duck 2023-03-08 13:30:00 5
#> 13 Duck 2023-03-08 14:00:00 6
#> 14 Duck 2023-03-08 14:30:00 3
#> 15 Duck 2023-03-08 15:00:00 5
#> 16 Duck 2023-03-08 15:30:00 4
#> 17 Duck 2023-03-08 16:00:00 17
#> 18 Duck 2023-03-08 16:30:00 1
#> 19 Duck 2023-03-08 17:00:00 13
#> 20 Duck 2023-03-08 17:30:00 5
#> 21 Goose 2023-03-08 08:00:00 8
#> 22 Goose 2023-03-08 09:00:00 6
#> 23 Goose 2023-03-08 09:30:00 2
#> 24 Goose 2023-03-08 10:00:00 6
#> 25 Goose 2023-03-08 10:30:00 5
#> 26 Goose 2023-03-08 11:30:00 5
#> 27 Goose 2023-03-08 12:00:00 5
#> 28 Goose 2023-03-08 12:30:00 2
#> 29 Goose 2023-03-08 13:00:00 3
#> 30 Goose 2023-03-08 13:30:00 2
#> 31 Goose 2023-03-08 14:00:00 4
#> 32 Goose 2023-03-08 14:30:00 10
#> 33 Goose 2023-03-08 15:00:00 7
#> 34 Goose 2023-03-08 15:30:00 1
#> 35 Goose 2023-03-08 17:00:00 7
#> 36 Goose 2023-03-08 17:30:00 4
Created on 2023-03-09 with reprex v2.0.2
If you do indeed have an invalid tsibble (different counts of geese at the same point in time for instance), you should carefully fix this issue before converting the data to a tsibble.