I have a messy data set, which generally resembles the output of the following
schools_messy <- tibble::tribble(
~data,
"state:maryland",
"location:bowie||name:bowie state university",
"grade:freshman||count:100",
"grade:sophomore||count:200",
"grade:junior||count:300",
"grade:senior||count:400",
"location:baltimore||name:coppin state university",
"grade:freshman||count:100",
"grade:sophomore||count:200",
"grade:junior||count:300",
"grade:senior||count:400",
"state:virginia",
"location:williamsburg||name:college of william and mary",
"grade:freshman||count:100",
"grade:sophomore||count:200",
"grade:junior||count:300",
"grade:senior||count:400",
"location:fairfax||name:george mason university",
"grade:freshman||count:100",
"grade:sophomore||count:200",
"grade:junior||count:300",
"grade:senior||count:400",
)
My desired end state is to have it resemble the output of the following
schools_tidy <- tribble(
~state, ~location, ~name, ~grade, ~count,
"maryland", "bowie", "bowie state university", "freshman", 100,
"maryland", "bowie", "bowie state university", "sophomore", 200,
"maryland", "bowie", "bowie state university", "junior", 300,
"maryland", "bowie", "bowie state university", "senior", 400,
"maryland", "baltimore", "coppin state university", "freshman", 100,
"maryland", "baltimore", "coppin state university", "sophomore", 200,
"maryland", "baltimore", "coppin state university", "junior", 300,
"maryland", "baltimore", "coppin state university", "senior", 400,
"virginia", "williamsburg", "college of william and mary", "freshman", 100,
"virginia", "williamsburg", "college of william and mary", "sophomore", 200,
"virginia", "williamsburg", "college of william and mary", "junior", 300,
"virginia", "williamsburg", "college of william and mary", "senior", 400,
"virginia", "fairfax", "george mason university", "freshman", 100,
"virginia", "fairfax", "george mason university", "sophomore", 200,
"virginia", "fairfax", "george mason university", "junior", 300,
"virginia", "fairfax", "george mason university", "senior", 400,
)
I'm afraid that I'm quite at a loss for how to go about cleaning up the data set. It almost resembles a mangled json data set. I tried to do some manipulation to turn it into a proper json data set, then convert it to the desired tibble end state from there, but I was unsuccessful.
In base R you can do the following:
a <- sub("(count:\\d+$)", "\\1\n\n", schools_messy$data)
b <- textConnection(sub("||", "\n", a, fixed = TRUE))
d <- read.dcf(b, all = TRUE)
d[] <- lapply(d, \(x)x[cummax(seq_along(x) * !is.na(x))])
d
state location name grade count
1 maryland bowie bowie state university freshman 100
2 maryland bowie bowie state university sophomore 200
3 maryland bowie bowie state university junior 300
4 maryland bowie bowie state university senior 400
5 maryland baltimore coppin state university freshman 100
6 maryland baltimore coppin state university sophomore 200
7 maryland baltimore coppin state university junior 300
8 maryland baltimore coppin state university senior 400
9 virginia williamsburg college of william and mary freshman 100
10 virginia williamsburg college of william and mary sophomore 200
11 virginia williamsburg college of william and mary junior 300
12 virginia williamsburg college of william and mary senior 400
13 virginia fairfax george mason university freshman 100
14 virginia fairfax george mason university sophomore 200
15 virginia fairfax george mason university junior 300
16 virginia fairfax george mason university senior 400