I have the following dataset:
tier | value | begin_ms | end_ms | reaction |
---|---|---|---|---|
ortho | is new | 262432 | 362232 | 5 |
words | is | 262432 | 263000 | 30 |
metric | A | 262432 | 263000 | 30 |
words | new | 263000 | 362232 | 25 |
metric | B | 263000 | 362232 | 25 |
I was trying to create a new data frame in a more tidy fashion, in which I would have each occurrence of the column ortho and the occurences within the same begin_ms and end_ms as columns. I tried to use
data_spread <- spread(dfgs_final, key = tier, value = value)
but it only partially worked, looking like this:
ortho | begin_ms | end_ms | words | metric | reaction |
---|---|---|---|---|---|
is new | 262432 | 362232 | 5 | ||
262432 | 263000 | is | A | 30 | |
263000 | 362232 | new | B | 25 |
Is there a way to group everything that is within the begin_ms and end_ms of the ortho column? I have something like this in mind:
ortho | begin_ms | end_ms | words | metric | reaction |
---|---|---|---|---|---|
is new | 262432 | 362232 | 5 | ||
is new | 262432 | 263000 | is | A | 30 |
is new | 263000 | 362232 | new | B | 25 |
FYI, spread
has been retired/superseded since Aug 2019 (4.5 years ago) and its replacement pivot_wider
is much more powerful, I suggest you migrate to that. The equivalent code is here, plus I'm adding an id
field for the second code block.
library(dplyr)
library(tidyr)
wide <- pivot_wider(quux, id_cols = c("begin_ms", "end_ms", "reaction"),
names_from = "tier", values_from = "value") |>
mutate(id = row_number())
wide
# # A tibble: 3 × 7
# begin_ms end_ms reaction ortho words metric id
# <int> <int> <int> <chr> <chr> <chr> <int>
# 1 262432 362232 5 is new NA NA 1
# 2 262432 263000 30 NA is A 2
# 3 263000 362232 25 NA new B 3
From here, we can do a range-based join on the rows with is.na(ortho)
on those that are not NA
, reassign ortho
, and combine back with the data.
filter(wide, is.na(ortho)) |>
right_join(filter(wide, !is.na(ortho)),
join_by(between(x$begin_ms, y$begin_ms, y$end_ms),
between(x$end_ms, y$begin_ms, y$end_ms)),
suffix = c("", ".y")) |>
mutate(ortho = ortho.y[id.y]) |>
select(id, ortho) |>
right_join(wide, by = "id", suffix = c(".y", "")) |>
mutate(ortho = coalesce(ortho, ortho.y)) |>
select(-ortho.y) |>
arrange(id)
# # A tibble: 3 × 7
# id begin_ms end_ms reaction ortho words metric
# <int> <int> <int> <int> <chr> <chr> <chr>
# 1 1 262432 362232 5 is new NA NA
# 2 2 262432 263000 30 is new is A
# 3 3 263000 362232 25 is new new B
In the end, it might be preferred for you to have unique id
fields for each row; if you want to retain the "parent" id
, we can do that with little change.
filter(wide, is.na(ortho)) |>
right_join(filter(wide, !is.na(ortho)),
join_by(between(x$begin_ms, y$begin_ms, y$end_ms),
between(x$end_ms, y$begin_ms, y$end_ms)),
suffix = c("", ".y")) |>
mutate(ortho = ortho.y[id.y]) |>
select(id, parentid = id.y, ortho) |> # THE ONLY CHANGE
right_join(wide, by = "id", suffix = c(".y", "")) |>
mutate(ortho = coalesce(ortho, ortho.y)) |>
select(-ortho.y) |>
arrange(id)
# # A tibble: 3 × 8
# id parentid begin_ms end_ms reaction ortho words metric
# <int> <int> <int> <int> <int> <chr> <chr> <chr>
# 1 1 NA 262432 362232 5 is new NA NA
# 2 2 1 262432 263000 30 is new is A
# 3 3 1 263000 362232 25 is new new B
Data
quux <- structure(list(tier = c("ortho", "words", "metric", "words", "metric"), value = c("is new", "is", "A", "new", "B"), begin_ms = c(262432L, 262432L, 262432L, 263000L, 263000L), end_ms = c(362232L, 263000L, 263000L, 362232L, 362232L), reaction = c(5L, 30L, 30L, 25L, 25L)), class = "data.frame", row.names = c(NA, -5L))