rreshaperiverplot

Reshape data frame for consecutive years


I have data about thousands of customers who visited stores in the 3 past years. For each customer, I have :

Customer_Id | Year_*_Store 
1            2010_A
1            2011_B
1            2012_C
2            2010_A
2            2011_B
2            2012_D

What I’d like to have is the following structure of data in order to visualize the evolution of the customers’behaviour with a riverplot( aka Sankey plot)

For instance the 2 customers, who firstly visited the store A in 2010, firstly visited the store B in 2011:

SOURCE |     TARGET |   NB_CUSTOMERS
2010_A      2011_B      2
2011_B      2012_C      1
2011_B      2012_D      1

I don't want links between two years which are not consecutive like 2010_A and 2012_D

How can I do that in R ?


Solution

  • Note that you can't have a * in column names (see ?make.names). Here is a basic approach:

    1. Split Year_store into two separate columns Year and Store in your data frame; at the moment it contains two completely different kinds of data and you actually need to process them separately.

    2. Make a NextYear column, defined as Year + 1

    3. Make a NextStore column, in which you assign the store code matching Customer_Id and for which Year is the same as this row's NextYear, assigning NA if there is no record of the customer visiting a store the next year, and throwing an error if the data do not meet the required specification (are ambiguous about which store was visited first the next year).

    4. Strip out any of the rows in which NextStore is NA, and combine the NextYear and NextStore columns into a NextYear_NextStore column.

    5. Summarize your data frame by the Year_store and NextYear_NextStore columns e.g. using ddply in the plyr package.

    Some sample data:

    # same example data as question
    customer.df <- data.frame(Customer_Id = c(1, 1, 1, 2, 2, 2),
        Year_Store = c("2010_A", "2011_B", "2012_C", "2010_A", "2011_B", "2012_D"),
        stringsAsFactors = FALSE)
    
    # alternative data should throw error, customer 2 is inconsistent in 2011
    badCustomer.df <- data.frame(Customer_Id = c(1, 1, 1, 2, 2, 2),
        Year_Store = c("2010_A", "2011_B", "2012_C", "2010_A", "2011_B", "2011_D"),
        stringsAsFactors = FALSE)
    

    And an implementation:

    require(plyr)
    
    splitYearStore <-  function(df) {
        df$Year <- as.numeric(substring(df$Year_Store, 1, 4))
        df$Store <- as.character(substring(df$Year_Store, 6))
        return(df) 
    }
    
    findNextStore <- function(df, matchCust, matchYear) {
        matchingStore <- with(df,
            df[Customer_Id == matchCust & Year == matchYear, "Store"])
        if (length(matchingStore) == 0) {
            return(NA)
        } else if (length(matchingStore) > 1) {
            errorString <- paste("Inconsistent store results for customer",
                matchCust, "in year", matchYear)
            stop(errorString)
        } else {
            return(matchingStore)
        }
    }
    
    tabulateTransitions <-  function(df) {
        df <- splitYearStore(df)
        df$NextYear <- df$Year + 1
        df$NextStore <- mapply(findNextStore, matchCust = df$Customer_Id,
            matchYear = df$NextYear, MoreArgs = list(df = df)) 
        df$NextYear_NextStore <- with(df, paste(NextYear, NextStore, sep = "_"))
        df <- df[!is.na(df$NextStore),]
        df <- ddply(df, .(Source = Year_Store, Target = NextYear_NextStore),
            summarise, No_Customers = length(Customer_Id))
        return(df) 
    }
    

    Results:

    > tabulateTransitions(customer.df)
      Source Target No_Customers
    1 2010_A 2011_B            2
    2 2011_B 2012_C            1
    3 2011_B 2012_D            1
    > tabulateTransitions(badCustomer.df)
    Error in function (df, matchCust, matchYear)  : 
      Inconsistent store results for customer 2 in year 2011
    

    No attempt has been made to optimise; if your data set is massive then perhaps you should investigate a data.table solution.