Big Cartel has an option that exports orders into a csv file. However the structure is not very good for the analysis I need to do.
Here is a subset of the columns and rows from a Big cartel csv order download (there are other columns which are not significant to the problem at hand).
Number, Buyer name,Items,Item count,Item total,Total price,Total shipping,Total tax,Total discount
1,jim,product_name:Plate|product_option_name:Red|quantity:1|price:9.99|total:9.99,1,9.99,11.98,1.99,0,0
2,bill,product_name:Plate|product_option_name:Green|quantity:1|price:9.99|total:9.99;product_name:Plate|product_option_name:Blue|quantity:1|price:9.99|total:9.99,2,19.98,22.98,3,0,0
3,jane,product_name:Plate|product_option_name:Red|quantity:1|price:6.99|total:6.99;product_name:Thingy|product_option_name:|quantity:1|price:9.99|total:9.99;product_name:Mug|product_option_name:Grey|quantity:1|price:10.99|total:10.99;product_name:Cup|product_option_name:Grey|quantity:1|price:9.99|total:9.99;product_name:Saucer|product_option_name:Grey|quantity:1|price:9.99|total:9.99;product_name:Stopper|product_option_name:|quantity:1|price:9.99|total:9.99,6,57.94,64.94,7,0,0
4,dale,product_name:Plate|product_option_name:Green|quantity:1|price:10.99|total:10.99,1,10.99,13.99,4.99,0,1.99
The items column can have multiple "line-items" with a semicolon (;) as the separator. Each "line-item" has five attributes separated with a pipe (|), i.e. product_name,product_option_name, quantity, price and total (i.e. for the line). There is a column, "Item count", which gives the number of "line-items" plus columns for (order) Total Price, shipping, Tax and Discount. For analysis I'd like the data in the following long format where shipping, tax and discount are also treated as 'product items'.
Number Buyer name line-item product_option_name quantity price total
1 jim Plate Red 1 9.99 9.99
1 jim shipping 1 1.99 1.99
1 jim tax 0 0 0
1 jim discount 0 0 0
2 bill Plate Green 1 9.99 9.99
2 bill Plate Blue 1 9.99 9.99
2 bill shipping 1 3 3
2 bill tax 0 0 0
2 bill discount 0 0 0
3 jane Plate Red 1 6.99 6.99
3 jane Thingy 1 9.99 9.99
3 jane Mug Grey 1 10.99 10.99
3 jane Cup Grey 1 9.99 9.99
3 jane Saucer Grey 1 9.99 9.99
3 jane Stopper 1 9.99 9.99
3 jane shipping 1 7 7
3 jane tax 0 0 0
3 jane discount 0 0 0
4 dale Plate Green 1 10.99 10.99
4 dale shipping 1 4.99 4.99
4 dale tax 0 0
4 dale discount 0 -1.99 -1.99
Using tstrsplit() from r:data.table and cSplit() from r:splitstackshape seems to be the solution but I can't get the syntax right. I've also tried the tidyverse/dplyr functions separate/spread etc but I just cannot get the output I need whatever I do.
I've been googling and searching through all the SO questions - there are some solutions (this one R: Split Variable Column into multiple (unbalanced) columns by comma) that are close but none quite get me there as most assume a wide' format rather than 'long'.
Something like this may get you what you're looking for.
library(dplyr)
library(tidyr)
library(stringr)
filepath <- # Path to datafile here
df <- read.csv(filepath, stringsAsFactors = FALSE)
cols <- paste0("col", 1:(max(str_count(df$Items, ";")) + 1))
df <- df %>%
separate(col = Items, into = cols, sep = ";", fill = "right") %>%
gather_("column", "details", cols, na.rm = TRUE) %>%
select(-column) %>%
separate(col = details, into = c("product_name", "product_option_name","quantity","price","total"), sep = "\\|", fill = "right") %>%
mutate(product_name = sub("^.*\\:", "", product_name),
product_option_name = sub("^.*\\:", "", product_option_name),
quantity = sub("^.*\\:", "", quantity),
price = sub("^.*\\:", "", price),
total = sub("^.*\\:", "", total)) %>%
gather("line", "item", c(Total.shipping, Total.discount, Total.tax, product_name)) %>%
mutate(product_option_name = ifelse(line == "product_name" & product_option_name != "", product_option_name, NA),
line_item = ifelse(line == "product_name", item, sub("^.*\\.","", line)),
price = ifelse(line == "product_name", price, item),
price = ifelse(line_item == "discount", as.numeric(price) * (-1), price),
quantity = ifelse(line_item %in% c("shipping","discount","tax") & price == "0", 0, quantity),
total = as.numeric(price) * as.numeric(quantity)) %>%
distinct() %>%
select(Number, Buyer.name, line_item, product_option_name, quantity, price, total) %>%
arrange(Number)