[SOLVED] how to aggregate data that occured within the same time window?

how to aggregate data that occured within the same time window?

I have the following dataset:

tier	value	begin_ms	end_ms	reaction
ortho	is new	262432	362232	5
words	is	262432	263000	30
metric	A	262432	263000	30
words	new	263000	362232	25
metric	B	263000	362232	25

I was trying to create a new data frame in a more tidy fashion, in which I would have each occurrence of the column ortho and the occurences within the same begin_ms and end_ms as columns. I tried to use

data_spread <- spread(dfgs_final, key = tier, value = value)

but it only partially worked, looking like this:

ortho	begin_ms	end_ms	words	metric	reaction
is new	262432	362232			5
	262432	263000	is	A	30
	263000	362232	new	B	25

Is there a way to group everything that is within the begin_ms and end_ms of the ortho column? I have something like this in mind:

ortho	begin_ms	end_ms	words	metric	reaction
is new	262432	362232			5
is new	262432	263000	is	A	30
is new	263000	362232	new	B	25

Solution

FYI, spread has been retired/superseded since Aug 2019 (4.5 years ago) and its replacement pivot_wider is much more powerful, I suggest you migrate to that. The equivalent code is here, plus I'm adding an id field for the second code block.

library(dplyr)
library(tidyr)
wide <- pivot_wider(quux, id_cols = c("begin_ms", "end_ms", "reaction"),
                    names_from = "tier", values_from = "value") |>
  mutate(id = row_number())
wide
# # A tibble: 3 × 7
#   begin_ms end_ms reaction ortho  words metric    id
#      <int>  <int>    <int> <chr>  <chr> <chr>  <int>
# 1   262432 362232        5 is new NA    NA         1
# 2   262432 263000       30 NA     is    A          2
# 3   263000 362232       25 NA     new   B          3

From here, we can do a range-based join on the rows with is.na(ortho) on those that are not NA, reassign ortho, and combine back with the data.

filter(wide, is.na(ortho)) |>
  right_join(filter(wide, !is.na(ortho)),
             join_by(between(x$begin_ms, y$begin_ms, y$end_ms),
                     between(x$end_ms, y$begin_ms, y$end_ms)),
             suffix = c("", ".y")) |>
  mutate(ortho = ortho.y[id.y]) |>
  select(id, ortho) |>
  right_join(wide, by = "id", suffix = c(".y", "")) |>
  mutate(ortho = coalesce(ortho, ortho.y)) |>
  select(-ortho.y) |>
  arrange(id)
# # A tibble: 3 × 7
#      id begin_ms end_ms reaction ortho  words metric
#   <int>    <int>  <int>    <int> <chr>  <chr> <chr> 
# 1     1   262432 362232        5 is new NA    NA    
# 2     2   262432 263000       30 is new is    A     
# 3     3   263000 362232       25 is new new   B

In the end, it might be preferred for you to have unique id fields for each row; if you want to retain the "parent" id, we can do that with little change.

filter(wide, is.na(ortho)) |>
  right_join(filter(wide, !is.na(ortho)),
             join_by(between(x$begin_ms, y$begin_ms, y$end_ms),
                     between(x$end_ms, y$begin_ms, y$end_ms)),
             suffix = c("", ".y")) |>
  mutate(ortho = ortho.y[id.y]) |>
  select(id, parentid = id.y, ortho) |>               # THE ONLY CHANGE
  right_join(wide, by = "id", suffix = c(".y", "")) |>
  mutate(ortho = coalesce(ortho, ortho.y)) |>
  select(-ortho.y) |>
  arrange(id)
# # A tibble: 3 × 8
#      id parentid begin_ms end_ms reaction ortho  words metric
#   <int>    <int>    <int>  <int>    <int> <chr>  <chr> <chr> 
# 1     1       NA   262432 362232        5 is new NA    NA    
# 2     2        1   262432 263000       30 is new is    A     
# 3     3        1   263000 362232       25 is new new   B

Data

quux <- structure(list(tier = c("ortho", "words", "metric", "words", "metric"), value = c("is new", "is", "A", "new", "B"), begin_ms = c(262432L, 262432L, 262432L, 263000L, 263000L), end_ms = c(362232L, 263000L, 263000L, 362232L, 362232L), reaction = c(5L, 30L, 30L, 25L, 25L)), class = "data.frame", row.names = c(NA, -5L))