I have a table with approximately 2000 entries that contains names
, positions
, field of expertise
, and addresses of professors
. The table is quite messy, and I’m struggling to find a programmatic way to transform and pivot it into a tidy format.
My goal is to create a tidy table that includes at least the following columns: names
, positions
, field of expertise
, and email address
.
(one recurring pattern throughout the data is that each person is separated by a NA
from the next one.)
Here is a sample of the data:
data | field |
---|---|
Person 1, MD | A |
Associate Professor of A | A |
Program Associate, A | A |
Senior Medical Director, FGP Operations | A |
UMMG Ambulatory Surgery | A |
Residency Program | A |
Core Educational Lead | A |
Email1@email.edu | A |
NA | A |
Person 2, MD | B |
Person 2 | B |
Clinical Assistant Professor of B | B |
Medical Student Clerkship and Core Educational Lead | B |
Email2@email.edu | B |
NA | B |
labcoat | B |
Person 3, MD | B |
Clinical Assistant Professor of B | B |
Email3@email.edu | B |
NA | B |
labcoat | B |
Person 4, MD | B |
Professor of B | B |
Professor of B | B |
Email4@email.edu | B |
NA | B |
Person 5, MD | C |
Person 5 | C |
Professor of C | C |
Professor of C | C |
Associate Chair, Quality | C |
Department of Urology and Service Chief | C |
Email5@email.edu | C |
132-547-1321 | C |
NA | C |
Here is the tibble
code (to reproduce):
tibble::tribble(
~data, ~field,
"Person 1, MD", "A",
"Associate Professor of A", "A",
"Program Associate, A", "A",
"Senior Medical Director, FGP Operations", "A",
"UMMG Ambulatory Surgery", "A",
"Residency Program", "A",
"Core Educational Lead", "A",
"Email1@email.edu", "A",
NA, "A",
"Person 2, MD", "B",
"Person 2", "B",
"Clinical Assistant Professor of B", "B",
"Medical Student Clerkship and Core Educational Lead", "B",
"Email2@email.edu", "B",
NA, "B",
"labcoat", "B",
"Person 3, MD", "B",
"Clinical Assistant Professor of B", "B",
"Email3@email.edu", "B",
NA, "B",
"labcoat", "B",
"Person 4, MD", "B",
"Professor of B", "B",
"Professor of B", "B",
"Email4@email.edu", "B",
NA, "B",
"Person 5, MD", "C",
"Person 5", "C",
"Professor of C", "C",
"Professor of C", "C",
"Associate Chair, Quality", "C",
"Department of Urology and Service Chief", "C",
"Email5@email.edu", "C",
"132-547-1321", "C",
NA, "C"
)
Answered before your comment regarding each person being separated by NA so it does not retain any rows after an email address (but you implied this was not important in your question anyway). Here's a stepwise approach. Highlight up to each pipe "|>" and run to see how each step works. Comment below if you need further explanation.
library(dplyr)
library(tidyr)
library(stringr)
df_wide <- df |>
mutate(names = case_when(str_detect(data, ", MD$") ~ "name",
str_detect(data, "@") ~ "email",
.default = NA),
data = str_remove(data, ", MD$"),
tmp_start = cumsum(!is.na(names) & names == "name"),
tmp_end = lag(cumsum(!is.na(names) & names == "email"), default = 0),
id = if_else(tmp_start == tmp_end, NA, tmp_start)) |>
group_by(id) |>
fill(id, .direction = "downup") |>
select(-starts_with("tmp")) |>
filter(!is.na(id)) |>
filter(!duplicated(data, fromFirst = TRUE)) |>
mutate(names = if_else(is.na(names), paste0("position", 1:n() - 1), names)) |>
pivot_wider(names_from = names,
values_from = data) |>
ungroup() |>
select(-id)
data.frame(df_wide)
# field name position1 position2 position3 position4 position5 position6 email
# A Person 1 Associate Professor of A Program Associate, A Senior Medical Director, FGP Operations UMMG Ambulatory Surgery Residency Program Core Educational Lead Email1@email.edu
# B Person 2 Clinical Assistant Professor of B Medical Student Clerkship and Core Educational Lead <NA> <NA> <NA> <NA> Email2@email.edu
# B Person 3 Clinical Assistant Professor of B <NA> <NA> <NA> <NA> <NA> Email3@email.edu
# B Person 4 Professor of B <NA> <NA> <NA> <NA> <NA> Email4@email.edu
# C Person 5 Professor of C Associate Chair, Quality Department of Urology and Service Chief <NA> <NA> <NA> Email5@email.edu