rstringloopssplit

How to get R to create new column (named from left part of string in old column), and then put right part of string from old column into new column


Given an existing dataframe containing a character column such as that shown below (oldColumn1), I want to have R automatically create a new column, in the same data frame, named from the left part of the string (e.g. COLOR).

Then for each row put the right part of the string contents appearing after the ": " (e.g. RED, BLUE, ETC) into the new column named "COLOR".

There are many old columns (oldColumn1, oldColumn2, etc) that need to be split out like this so doing this manually is impractical. Thanks in advance for any help you might provide.

# Here is an example of 3 oldColumns that already exist in dataframe.
# There are thousands of these columns, need to auto create a new
# column for each one as described.
# Maybe hoping to have the oldColumn names in a vector, to then pass
# to a function that creates a new column for each oldColumn. 

oldColumn1 <- c('COLOR: RED', 'COLOR: RED', 'COLOR: BLUE', 'COLOR: GREEN', 'COLOR: BLUE')
oldColumn2 <- c('SIZE: LARGE', 'SIZE: MEDIUM','SIZE: XLARGE','SIZE: MEDIUM','SIZE: SMALL')
oldColumn3 <- c('DESIGNSTYLE: STYLED', 'DESIGNSTYLE: ORIGINAL MAKER', 'DESIGNSTYLE: COUTURE','DESIGNSTYLE: COUTURE','DESIGNSTYLE: STYLED')
COLOR <- c('RED', 'RED', 'BLUE', 'GREEN', 'BLUE')
SIZE <- c('LARGE', 'MEDIUM', 'XLARGE', 'MEDIUM', 'SMALL')
DESIGNSTYLE <- c('STYLED', 'ORIGINAL MAKER', 'COUTURE', 'COUTURE', 'STYLED')
dat <- data.frame(oldColumn1, oldColumn2, oldColumn3, COLOR, SIZE, DESIGNSTYLE)
dat

Solution

  • Starting with

    quux <- structure(list(oldColumn1 = c("COLOR: RED", "COLOR: RED", "COLOR: BLUE", "COLOR: GREEN", "COLOR: BLUE")), class = "data.frame", row.names = c(NA, -5L))
    

    The naive approach would be

    data.frame(COLOR = trimws(sub("COLOR:", "", quux$oldColumn1)))
    #   COLOR
    # 1   RED
    # 2   RED
    # 3  BLUE
    # 4 GREEN
    # 5  BLUE
    

    But I'm assuming you have a more generic need. Let's assume that you have some more things to parse out of that, such as

    quux <- structure(list(oldColumn1 = c("COLOR: RED", "COLOR: RED", "COLOR: BLUE", "COLOR: GREEN", "COLOR: BLUE", "SIZE: 1", "SIZE: 3", "SIZE: 5")), class = "data.frame", row.names = c(NA, -8L))
    quux
    #     oldColumn1
    # 1   COLOR: RED
    # 2   COLOR: RED
    # 3  COLOR: BLUE
    # 4 COLOR: GREEN
    # 5  COLOR: BLUE
    # 6      SIZE: 1
    # 7      SIZE: 3
    # 8      SIZE: 5
    

    then we can generalize it with

    tmp <- strcapture("(.*)\\s*:\\s*(.*)", quux$oldColumn1, list(k="", v=""))
    tmp$ign <- ave(rep(1L, nrow(tmp)), tmp$k, FUN = seq_along)
    reshape2::dcast(tmp, ign ~ k, value.var = "v")[,-1,drop=FALSE]
    #   COLOR SIZE
    # 1   RED    1
    # 2   RED    3
    # 3  BLUE    5
    # 4 GREEN <NA>
    # 5  BLUE <NA>
    

    --

    Edit: alternative with updated data:

    do.call(cbind, lapply(dat, function(X) {
      nm <- sub(":.*", "", X[1])
      out <- data.frame(trimws(sub(".*:", "", X)))
      names(out) <- nm
      out
    }))
    #   COLOR   SIZE    DESIGNSTYLE
    # 1   RED  LARGE         STYLED
    # 2   RED MEDIUM ORIGINAL MAKER
    # 3  BLUE XLARGE        COUTURE
    # 4 GREEN MEDIUM        COUTURE
    # 5  BLUE  SMALL         STYLED