rstringdataframeseparator

Splitting R string based on multiple criteria (data does not have a simple delineation like spaces or comma to separate by)


I have a dataset where each row is a character string like this (and this is all under one column):

Jacoby BrissettO 206.5-115U 206.5-115Joe BurrowO 243.5-115U 243.5-155.

I want to split this into rows and columns like so:

Name O/U Line
Jacoby Brissett 206.5
Joe Burrow 243.5

I can post specific code if needed, but this illustrates my exact problem pretty much, so you can type the example character string and try to fix it yourself in R. I've looked everywhere for a solution to this, but most similar questions are easily solvable by separating by commas or spaces, which as you can see, is pretty impossible in this case.

I tried using the separate tidyverse function, the gsub function, the str_remove function, and the str_split_fixed function. I can't figure out the parameters to make any of these work, and I don't even know if these functions would have worked, which doesn't help me very much.


Solution

  • We assume that

    Then we can use this code:

    Lines <- "Jacoby BrissettO 206.5-115U 206.5-115Joe BurrowO 243.5-115U 243.5-155."
    
    Lines |>
      gsub("(\\d)([A-Z][a-z])", "\\1\n\\2", x = _) |>
      gsub("O ", ",", x = _) |>
      read.table(text = _, sep = ",", comment = "-", col.names = c("Name", "Value"))
    

    giving

                 Name Value
    1 Jacoby Brissett 206.5
    2      Joe Burrow 243.5
    

    Added

    Some additional data was added in a comment. We run it on that as well it seems to work there too.

    Lines <- c("Jacoby BrissettO 206.5-115U 206.5-115Joe BurrowO 243.5-115U 243.5-155.",
    "Gardner MinshewO 255.5-115U 255.5-115Justin HerbertO 150.5-115U 150.5-115",
    "Bo NixO 140.5-115U 140.5-115Geno SmithO 137.5-115U 137.5−115")
    
    Lines |>
      gsub("(\\d)([A-Z][a-z])", "\\1\n\\2", x = _) |>
      gsub("O ", ",", x = _) |>
      read.table(text = _, sep = ",", comment = "-", col.names = c("Name", "Value"))
    

    giving

                 Name Value
    1 Jacoby Brissett 206.5
    2      Joe Burrow 243.5
    3 Gardner Minshew 255.5
    4  Justin Herbert 150.5
    5          Bo Nix 140.5
    6      Geno Smith 137.5