rcomparegrouped-table

Better Solution than For Loops when Comparing Values of Grouped Variables in R


I am fairly new to R, and have spent a long time looking for a better way to solve my issue below with no success. I was able to develop a solution using for loops that works; however, I am breaking the rules described on this GitHub tutorial discussing what to avoid when writing loops.

I am working with sales data. My particular data frame includes Product Categories ("CAT_NO"), Customer Deciles ("CUST_DECILE") (customers are placed into a Decile group from 1 to 10, with 1 being the "best" customers) and floor gross margins ("floorGM") for that Product Category, Customer Decile combination. It may be worth noting that not every Product Category will have all Customer Deciles represented (e.g., Sample Category "A" may only have Customer Deciles 4, 7, and 9. For simplicity purposes, the reproducible example below ensures that each Product Category has all 10 Customer Deciles). My data set can be represented by:

    df <- data.frame(CAT_NO = c(rep(c("A"), times = 10), rep(c("B"), times = 10),
                        rep(c("C"), times = 10), rep(c("D"), times = 10))
             , CUST_DECILE = rep(c(1:10), times = 4), floorGM = runif(40, 0.2, 0.8))

    df

My goal is to look at each product category and compare the floor gross margins of each customer decile; if a customer in a lower decile has a higher floorGM than a customer in a higher decile, the higher decile customer should take the lower decile's floorGM.

The logic I've used subsets the data by each CAT_NO, and then applies a loop to compare the floorGM of each CUST_DECILE within that CAT_NO. My code is:

    Product_Categories <- as.character(unique(df$CAT_NO))

    for(k in seq_along(Product_Categories)) {
      subdata <- subset(df, CAT_NO == Product_Categories[k])
      deciles <- sort(unique(subdata$CUST_DECILE))

      for(k in 2:length(deciles)) {
        if(subdata[subdata$CUST_DECILE == subdata$CUST_DECILE[k], "floorGM"< subdata[subdata$CUST_DECILE == subdata$CUST_DECILE[k-1], "floorGM"]) {
          subdata[subdata$CUST_DECILE == subdata$CUST_DECILE[k], "floorGM"] <- subdata[subdata$CUST_DECILE == subdata$CUST_DECILE[k-1], "floorGM"]
        }
       }
      if (!exists("temp")) {
         temp <- subdata
      } else {
         temp <- rbind(temp, subdata) 
      }
    }

While this works, I'm certain there is a faster way to perform this operation, particularly as growing my data set with rbind() during the loop is going to impede performance as I scale this solution across millions of transactions.

Thank you for any input and/or additional references!


Solution

  • Can't vouch for how fast this will be for millions of rows (it certainly took a while for 40,000 rows on my slow system), but here's a solution (using dplyr):

    df<-group_by(df,CAT_NO)
    df<-mutate(df, lag=lag(floorGM))
    while (any(df$floorGM<df$lag,na.rm=T)) {
      df<-mutate(df, floorGM=ifelse(!is.na(lag),ifelse(floorGM<lag,lag,floorGM),floorGM))
      df<-mutate(df, lag=lag(floorGM))
    }
    

    The while loop basically bubbles the floorGM numbers over the entire category.

    (actually, come to think of it, it shouldn't require a great many loops anyway - since each category can only have 10 deciles - so I think it should be ok).