rtidyr

Separate string column of varying lengths and multiple delimiters into new columns using tidyr


I would like to separate the prod_code_date column (a string of varying lengths, from one "pair" to multiple "pairs") into multiple columns based on the delimiter ":" and ",".

The ":" delimiter separates related information into product code and date of purchase - which can be thought of as paired information, while the "," delimiter separates the different pairs of information for the same product number (prod_no).

Expected intermediate result from separate_wider_delim The number of columns created should be based on the number of delimiters in the column. The column names should be code_1, date_1, code_2, date_2, code_x, date_x, etc.

Final expected result (sample data below): A long table with the following columns: prod_no, code, date, with repeated prod_no rows for as many pairs as there are in the prod_code_date column.

library(tidyverse)

# Data
df <- tibble(prod_no = 1:4, prod_code_date = c("' ZB10.90 : 2013-04-29'", "' XJ11.90 : 2016-10-20, ZB25.22 : 2013-10-16, ZB25.29 : 2011-12-06, XJ14.20 : 2022-03-23, ZB10.90 : 2022-12-16, ZB10.90 : 2011-12-06, QP50.19 : 2016-03-11, QP12.90 : 2012-01-20, MS44.9 : 2022-03-23'", "' MS34.3 : 2022-10-04, QP13.20 : 1998-05-26, QP50.13 : 2008-10-10, MS44.9 : 2017-05-16'", "' QP10.90 : 2008-08-11, QP11.90 : 2019-04-15'"))

# Attempt (failed) using separate_wider_delim() function. After which I would have pivoted the data to a long format.
intermediate_result <- df %>% separate_wider_delim(prod_code_date, delim = c(":", ","), names = c("code_1", "date_1", "code_2", "date_2", "code_x", "date_x"))

# Expected output: A long table with the following columns: prod_no, code, date, with repeated prod_no rows for as many pairs as there are in the prod_code_date column.
final_result <- tibble(prod_no = c(1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4), code = c("QP10.90", "XJ11.90", "QP25.22", "QP25.29", "XJ14.20", "QP10.90", "QP10.90", "QP50.19", "QP12.90", "MS34.3", "QP13.20", "QP50.13", "MS44.9", "QP10.90", "QP11.90"), date = c("2013-04-29", "2016-10-20", "2013-10-16", "2011-12-06", "2022-03-23", "2022-12-16", "2011-12-06", "2016-03-11", "2012-01-20", "2022-10-04", "1998-05-26", "2008-10-10", "2017-05-16", "2008-08-11", "2019-04-15"))


Solution

  • An approach using separate_longer_delim and a consecutive separate_wider_delim

    library(dplyr)
    library(tidyr)
    
    df %>% 
      separate_longer_delim(prod_code_date, ",") %>% 
      mutate(prod_code_date = gsub(" +|'", "", prod_code_date)) %>% 
      separate_wider_delim(prod_code_date, delim=":", names=c("code", "date"))
    # A tibble: 16 × 3
       prod_no code    date
         <int> <chr>   <chr>
     1       1 ZB10.90 2013-04-29
     2       2 XJ11.90 2016-10-20
     3       2 ZB25.22 2013-10-16
     4       2 ZB25.29 2011-12-06
     5       2 XJ14.20 2022-03-23
     6       2 ZB10.90 2022-12-16
     7       2 ZB10.90 2011-12-06
     8       2 QP50.19 2016-03-11
     9       2 QP12.90 2012-01-20
    10       2 MS44.9  2022-03-23
    11       3 MS34.3  2022-10-04
    12       3 QP13.20 1998-05-26
    13       3 QP50.13 2008-10-10
    14       3 MS44.9  2017-05-16
    15       4 QP10.90 2008-08-11
    16       4 QP11.90 2019-04-15