raprioristring-aggregationmarket-basket-analysis

Data Preparation for Market Basket Analysis using Apriori Algorith using R


I am planning to run apriori algorithm on a dataset to find association among product purchase. The dataset looks like this

order_id <- c('AG-2011-2040','IN-2011-47883', 'HU-2011-1220','IT-2011-3647632','IN-2011-47883','IN-2011-47883','IN-2011-30733','CA-2011-115161','AO-2011-1390','ID-2011-56493'  )
    
product_name <- c('Tenex Lockers, Blue','Acme Trimmer, High Speed', 'Tenex Box, Single Width', 'Enermax Note Cards, Premium','Eldon Light Bulb, Duo Pack','Eaton Computer Printout Paper, 8.5 x 11', 'Brother Personal Copier, Laser','Sauder Facets Collection Library, Sky Alder Finish', 'Fellowes Lockers, Wire Frame','Tenex Trays, Single Width')
        
df <- data.frame(order_id, product_name)
df 

In the image, you can see that on the left there is Order.ID and in the right column, there's product name. I want to concatenate all the products that have the same id in a single row. In other words, I want to aggregate all the products based on Order.ID. Since products are not numerical, it is hard to do so.

Any idea on how to approach this problem would be appreciated. Alternatively, if you have a different idea on how to prepare this data to run apriori algorithm, that would be great too!

Thanks in advance!


Solution

  • You can concatenate all the products using the same order ID using the dplyr package's summarize (with the .by function for order id) and base R paste. Here, I chose to separate by a comma, but could change to anything else you'd want:

    library(dplyr)
    df %>% 
      summarise(ProductName = 
                  paste(ProductName, collapse = ", "), 
                .by = OrderID)
    
    #  OrderID      ProductName
    #1    EX-1       B, E, J, W
    #2    EX-2             D, I
    #3    EX-3             S, Y
    #4    EX-4             K, V
    #5    EX-5                C
    #6    EX-6    F, G, H, N, O
    #7    EX-7             L, T
    #8    EX-8 A, M, P, Q, R, U
    #9    EX-9                X
    
    # or for older versions of dplyr:
    df %>% 
      group_by(OrderID) %>%
      summarise(ProductName = 
                  paste(ProductName, collapse=", "))
    

    Data:

    df <- data.frame(OrderID = paste0("EX-", sample(1:9, 25, replace = TRUE)),
                     ProductName = LETTERS[1:25])
    df <- df[order(df$OrderID),]