rsplitaggregatespread

how to aggregate data that occured within the same time window?


I have the following dataset:

tier value begin_ms end_ms reaction
ortho is new 262432 362232 5
words is 262432 263000 30
metric A 262432 263000 30
words new 263000 362232 25
metric B 263000 362232 25

I was trying to create a new data frame in a more tidy fashion, in which I would have each occurrence of the column ortho and the occurences within the same begin_ms and end_ms as columns. I tried to use

data_spread <- spread(dfgs_final, key = tier, value = value)

but it only partially worked, looking like this:

ortho begin_ms end_ms words metric reaction
is new 262432 362232 5
262432 263000 is A 30
263000 362232 new B 25

Is there a way to group everything that is within the begin_ms and end_ms of the ortho column? I have something like this in mind:

ortho begin_ms end_ms words metric reaction
is new 262432 362232 5
is new 262432 263000 is A 30
is new 263000 362232 new B 25

Solution

  • FYI, spread has been retired/superseded since Aug 2019 (4.5 years ago) and its replacement pivot_wider is much more powerful, I suggest you migrate to that. The equivalent code is here, plus I'm adding an id field for the second code block.

    library(dplyr)
    library(tidyr)
    wide <- pivot_wider(quux, id_cols = c("begin_ms", "end_ms", "reaction"),
                        names_from = "tier", values_from = "value") |>
      mutate(id = row_number())
    wide
    # # A tibble: 3 × 7
    #   begin_ms end_ms reaction ortho  words metric    id
    #      <int>  <int>    <int> <chr>  <chr> <chr>  <int>
    # 1   262432 362232        5 is new NA    NA         1
    # 2   262432 263000       30 NA     is    A          2
    # 3   263000 362232       25 NA     new   B          3
    

    From here, we can do a range-based join on the rows with is.na(ortho) on those that are not NA, reassign ortho, and combine back with the data.

    filter(wide, is.na(ortho)) |>
      right_join(filter(wide, !is.na(ortho)),
                 join_by(between(x$begin_ms, y$begin_ms, y$end_ms),
                         between(x$end_ms, y$begin_ms, y$end_ms)),
                 suffix = c("", ".y")) |>
      mutate(ortho = ortho.y[id.y]) |>
      select(id, ortho) |>
      right_join(wide, by = "id", suffix = c(".y", "")) |>
      mutate(ortho = coalesce(ortho, ortho.y)) |>
      select(-ortho.y) |>
      arrange(id)
    # # A tibble: 3 × 7
    #      id begin_ms end_ms reaction ortho  words metric
    #   <int>    <int>  <int>    <int> <chr>  <chr> <chr> 
    # 1     1   262432 362232        5 is new NA    NA    
    # 2     2   262432 263000       30 is new is    A     
    # 3     3   263000 362232       25 is new new   B     
    

    In the end, it might be preferred for you to have unique id fields for each row; if you want to retain the "parent" id, we can do that with little change.

    filter(wide, is.na(ortho)) |>
      right_join(filter(wide, !is.na(ortho)),
                 join_by(between(x$begin_ms, y$begin_ms, y$end_ms),
                         between(x$end_ms, y$begin_ms, y$end_ms)),
                 suffix = c("", ".y")) |>
      mutate(ortho = ortho.y[id.y]) |>
      select(id, parentid = id.y, ortho) |>               # THE ONLY CHANGE
      right_join(wide, by = "id", suffix = c(".y", "")) |>
      mutate(ortho = coalesce(ortho, ortho.y)) |>
      select(-ortho.y) |>
      arrange(id)
    # # A tibble: 3 × 8
    #      id parentid begin_ms end_ms reaction ortho  words metric
    #   <int>    <int>    <int>  <int>    <int> <chr>  <chr> <chr> 
    # 1     1       NA   262432 362232        5 is new NA    NA    
    # 2     2        1   262432 263000       30 is new is    A     
    # 3     3        1   263000 362232       25 is new new   B     
    

    Data

    quux <- structure(list(tier = c("ortho", "words", "metric", "words", "metric"), value = c("is new", "is", "A", "new", "B"), begin_ms = c(262432L, 262432L, 262432L, 263000L, 263000L), end_ms = c(362232L, 263000L, 263000L, 362232L, 362232L), reaction = c(5L, 30L, 30L, 25L, 25L)), class = "data.frame", row.names = c(NA, -5L))