rdata.tabletidyversebigcartelsplitstackshape

R split column in BigCartel csv file into long format in dataframe or data.table


Big Cartel has an option that exports orders into a csv file. However the structure is not very good for the analysis I need to do.

Here is a subset of the columns and rows from a Big cartel csv order download (there are other columns which are not significant to the problem at hand).

Number, Buyer name,Items,Item count,Item total,Total price,Total shipping,Total tax,Total discount
1,jim,product_name:Plate|product_option_name:Red|quantity:1|price:9.99|total:9.99,1,9.99,11.98,1.99,0,0
2,bill,product_name:Plate|product_option_name:Green|quantity:1|price:9.99|total:9.99;product_name:Plate|product_option_name:Blue|quantity:1|price:9.99|total:9.99,2,19.98,22.98,3,0,0
3,jane,product_name:Plate|product_option_name:Red|quantity:1|price:6.99|total:6.99;product_name:Thingy|product_option_name:|quantity:1|price:9.99|total:9.99;product_name:Mug|product_option_name:Grey|quantity:1|price:10.99|total:10.99;product_name:Cup|product_option_name:Grey|quantity:1|price:9.99|total:9.99;product_name:Saucer|product_option_name:Grey|quantity:1|price:9.99|total:9.99;product_name:Stopper|product_option_name:|quantity:1|price:9.99|total:9.99,6,57.94,64.94,7,0,0
4,dale,product_name:Plate|product_option_name:Green|quantity:1|price:10.99|total:10.99,1,10.99,13.99,4.99,0,1.99

The items column can have multiple "line-items" with a semicolon (;) as the separator. Each "line-item" has five attributes separated with a pipe (|), i.e. product_name,product_option_name, quantity, price and total (i.e. for the line). There is a column, "Item count", which gives the number of "line-items" plus columns for (order) Total Price, shipping, Tax and Discount. For analysis I'd like the data in the following long format where shipping, tax and discount are also treated as 'product items'.

Number Buyer name line-item    product_option_name quantity price total
1      jim        Plate        Red                 1        9.99  9.99
1      jim        shipping                         1        1.99  1.99
1      jim        tax                              0        0     0
1      jim        discount                         0        0     0
2      bill       Plate        Green               1        9.99  9.99
2      bill       Plate        Blue                1        9.99  9.99
2      bill       shipping                         1        3     3
2      bill       tax                              0        0     0
2      bill       discount                         0        0     0
3      jane       Plate        Red                 1        6.99  6.99
3      jane       Thingy                           1        9.99  9.99
3      jane       Mug          Grey                1        10.99 10.99
3      jane       Cup          Grey                1        9.99  9.99
3      jane       Saucer       Grey                1        9.99  9.99
3      jane       Stopper                          1        9.99  9.99
3      jane       shipping                         1        7     7
3      jane       tax                              0        0     0
3      jane       discount                         0        0     0
4      dale       Plate        Green               1        10.99 10.99
4      dale       shipping                         1        4.99  4.99
4      dale       tax                              0        0     
4      dale       discount                         0        -1.99 -1.99

Using tstrsplit() from r:data.table and cSplit() from r:splitstackshape seems to be the solution but I can't get the syntax right. I've also tried the tidyverse/dplyr functions separate/spread etc but I just cannot get the output I need whatever I do.

I've been googling and searching through all the SO questions - there are some solutions (this one R: Split Variable Column into multiple (unbalanced) columns by comma) that are close but none quite get me there as most assume a wide' format rather than 'long'.


Solution

  • Something like this may get you what you're looking for.

    library(dplyr)
    library(tidyr)
    library(stringr)
    
    filepath <- # Path to datafile here
    
    df <- read.csv(filepath, stringsAsFactors = FALSE)
    
    cols <- paste0("col", 1:(max(str_count(df$Items, ";")) + 1))
    
    df <- df %>%
          separate(col = Items, into = cols, sep = ";", fill = "right") %>%
          gather_("column", "details", cols, na.rm = TRUE) %>%
          select(-column) %>%
          separate(col = details, into = c("product_name", "product_option_name","quantity","price","total"), sep = "\\|", fill = "right") %>%
          mutate(product_name = sub("^.*\\:", "", product_name),
                 product_option_name = sub("^.*\\:", "", product_option_name),
                 quantity = sub("^.*\\:", "", quantity),
                 price = sub("^.*\\:", "", price),
                 total = sub("^.*\\:", "", total)) %>%
          gather("line", "item", c(Total.shipping, Total.discount, Total.tax, product_name)) %>%
          mutate(product_option_name = ifelse(line == "product_name" & product_option_name != "", product_option_name, NA),
                 line_item = ifelse(line == "product_name", item, sub("^.*\\.","", line)),
                 price = ifelse(line == "product_name", price, item),
                 price = ifelse(line_item == "discount", as.numeric(price) * (-1), price),
                 quantity = ifelse(line_item %in% c("shipping","discount","tax") & price == "0", 0, quantity),
                 total = as.numeric(price) * as.numeric(quantity)) %>%
          distinct() %>%
          select(Number, Buyer.name, line_item, product_option_name, quantity, price, total) %>%
          arrange(Number)