rtidyversedata-wrangling

How to tidy messy data


I have a messy data set, which generally resembles the output of the following

schools_messy <- tibble::tribble(
  ~data,
  "state:maryland",
  "location:bowie||name:bowie state university",
  "grade:freshman||count:100",
  "grade:sophomore||count:200",
  "grade:junior||count:300",
  "grade:senior||count:400",
  "location:baltimore||name:coppin state university",
  "grade:freshman||count:100",
  "grade:sophomore||count:200",
  "grade:junior||count:300",
  "grade:senior||count:400",
  
  "state:virginia",
  "location:williamsburg||name:college of william and mary",
  "grade:freshman||count:100",
  "grade:sophomore||count:200",
  "grade:junior||count:300",
  "grade:senior||count:400",
  "location:fairfax||name:george mason university",
  "grade:freshman||count:100",
  "grade:sophomore||count:200",
  "grade:junior||count:300",
  "grade:senior||count:400",
)

My desired end state is to have it resemble the output of the following

schools_tidy <- tribble(
  ~state, ~location, ~name, ~grade, ~count,
  "maryland", "bowie", "bowie state university", "freshman", 100,
  "maryland", "bowie", "bowie state university", "sophomore", 200,
  "maryland", "bowie", "bowie state university", "junior", 300,
  "maryland", "bowie", "bowie state university", "senior", 400,
  "maryland", "baltimore", "coppin state university", "freshman", 100,
  "maryland", "baltimore", "coppin state university", "sophomore", 200,
  "maryland", "baltimore", "coppin state university", "junior", 300,
  "maryland", "baltimore", "coppin state university", "senior", 400,
  "virginia", "williamsburg", "college of william and mary", "freshman", 100,
  "virginia", "williamsburg", "college of william and mary", "sophomore", 200,
  "virginia", "williamsburg", "college of william and mary", "junior", 300,
  "virginia", "williamsburg", "college of william and mary", "senior", 400,
  "virginia", "fairfax", "george mason university", "freshman", 100,
  "virginia", "fairfax", "george mason university", "sophomore", 200,
  "virginia", "fairfax", "george mason university", "junior", 300,
  "virginia", "fairfax", "george mason university", "senior", 400,
)

I'm afraid that I'm quite at a loss for how to go about cleaning up the data set. It almost resembles a mangled json data set. I tried to do some manipulation to turn it into a proper json data set, then convert it to the desired tibble end state from there, but I was unsuccessful.


Solution

  • In base R you can do the following:

    a <- sub("(count:\\d+$)", "\\1\n\n", schools_messy$data)
    b <- textConnection(sub("||", "\n", a, fixed = TRUE))
    d <- read.dcf(b, all = TRUE)
    d[] <- lapply(d, \(x)x[cummax(seq_along(x) * !is.na(x))])
    d
          state     location                        name     grade count
    1  maryland        bowie      bowie state university  freshman   100
    2  maryland        bowie      bowie state university sophomore   200
    3  maryland        bowie      bowie state university    junior   300
    4  maryland        bowie      bowie state university    senior   400
    5  maryland    baltimore     coppin state university  freshman   100
    6  maryland    baltimore     coppin state university sophomore   200
    7  maryland    baltimore     coppin state university    junior   300
    8  maryland    baltimore     coppin state university    senior   400
    9  virginia williamsburg college of william and mary  freshman   100
    10 virginia williamsburg college of william and mary sophomore   200
    11 virginia williamsburg college of william and mary    junior   300
    12 virginia williamsburg college of william and mary    senior   400
    13 virginia      fairfax     george mason university  freshman   100
    14 virginia      fairfax     george mason university sophomore   200
    15 virginia      fairfax     george mason university    junior   300
    16 virginia      fairfax     george mason university    senior   400