rdataframedatesummarization

Identify a value changes' date and summarize the data with sum() and diff() in R


Sample Data:

 product_id <- c("1000","1000","1000","1000","1000","1000", "1002","1002","1002","1002","1002","1002")
    qty_ordered <- c(1,2,1,1,1,1,1,2,1,2,1,1)
    price <- c(2.49,2.49,2.49,1.743,2.49,2.49,  2.093,2.093,2.11,2.11,2.11, 2.97)
    date <- c("2/23/15","2/23/15",  '3/16/15','3/16/15','5/16/15',  "6/18/15",  "2/19/15","3/19/15","3/19/15","3/19/15","3/19/15","4/19/15")
    sampleData <- data.frame(product_id,    qty_ordered,    price,  date)

I would like to identify every time when a change in a price occurred. Also, I would like to sum() the total qty_ordered between those two price change dates. For example, For product_id == "1000", price changed occurred on 3/16/15 from $2.49 to $1.743. The total qty_ordered is 1+2+1=4; the difference between those two earliest date of price change is from 2/23/15 to 3/16/15 which is 21 days.

So the New Data Frame should be:

product_id sum_qty_ordered price    date_diff 
1000           4          2.490             21 
1000           1           1.743            61 
1000           2           2.490            33 

Here are what I have tried:

**NOTE: for this case, a simple "dplyr::group_by" will not work since it will ignore the date effect.

1) I found this code from Determine when columns of a data.frame change value and return indices of the change: This is to identify every time when the price changed, which identify the first date when the price changed for each product.

IndexedChanged <- c(1,which(rowSums(sapply(sampleData[,3],diff))!=0)+1)
sampleData[IndexedChanged,]

However, I am not sure how to calculate the sum(qty_ordered) and the date difference for each of those entries if I use that code.

2) I tried to write a WHILE loop to temporarily store each batch of product_id, price, range of dates (e.g. a subset of data frame with one product_id, one price, and all entries ranged from the earliest date of price change till the last date of price before it changed), and then, summarise that subset to get sum(sum_qty_ordered) and the date diff. However, I think I always am confused by WHILE and FOR, so my code has some problems in it. Here is my code:

create an empty data frame for later data storage

 NewData_Ready <- data.frame(
                     product_id = character(),
                     price = double(),
                     early_date = as.Date(character()),
                     last_date=as.Date(character()),
                     total_qty_demanded = double(),                          
                     stringsAsFactors=FALSE) 

create a temp table to store the batch price order entries

 temp_dataset <- data.frame(
                     product_id = character(),
                     qty_ordered = double(),
                     price = double(),
                     date=as.Date(character()),                                  
                     stringsAsFactors=FALSE) 

loop: This is messy...and probably not make sense, so I do really help on this.

for ( i in unique(sampleData$product_id)){
    #for each unique product_id in the dataset, we are gonna loop through it based on product_id
    #for first product_id which is "1000"
    temp_table <- sampleData[sampleData$product_id == "i", ] #subset dataset by ONE single product_id
    #this dataset only has product of "1000" entries

    #starting a new for loop to loop through the entire entries for this product
    for ( p in 1:length(temp_table$product_id)){

        current_price <- temp_table$price[p] #assign current_price to the first price value
        #assign $2.49 to current price. 
        min_date <- temp_table$date[p] #assign the first date when the first price change
        #assign 2015-2-23 to min_date which is the earliest date when price is $2.49

        while (current_price == temp_table$price[p+1]){
        #while the next price is the same as the first price 
        #that is, if the second price is $2.49 is the same as the first price of $2.49, which is TRUE
        #then execute the following statement

            temp_dataset <- rbind(temp_dataset, temp_table[p,])
            #if the WHILE loop is TRUE, means every 2 entries have the same price
            #then combine each entry when price is the same in temp_table with the temp_dataset

            #if the WHILE loop is FALSE, means one entry's price is different from the next one
            #then stop the statement at the above, but do the following
            current_price <- temp_table$price[p+1]
            #this will reassign the current_price to the next price, and restart the WHILE loop

            by_idPrice <- dplyr::group_by(temp_dataset, product_id, price)
            NewRow <- dplyr::summarise(
                                early_date = min(date),
                                last_date = max(date),
                                total_qty_demanded = sum(qty_ordered))
            NewData_Ready <- rbind(NewData_Ready, NewRow)



        }
    }

}

I have searched a lot on related questions but I have not found anything that are related to this problem yet. If you have some suggestions, please let me know. Also, please provide some suggestions on the solution to my questions. I would greatly appreciate your time and help!

Here is my R version:
platform       x86_64-apple-darwin13.4.0   
arch           x86_64                      
os             darwin13.4.0                
system         x86_64, darwin13.4.0        
status                                     
major          3                           
minor          3.1                         
year           2016                        
month          06                          
day            21                          
svn rev        70800                       
language       R                           
version.string R version 3.3.1 (2016-06-21)
nickname       Bug in Your Hair      

Solution

  • Using data.table:

    library(data.table)
    setDT(sampleData)
    

    Some Preprocessing:

    sampleData[, firstdate := as.Date(date, "%m/%d/%y")]
    

    Based on how you calculate date diff, we are better off creating a range of dates for each row:

    sampleData[, lastdate := shift(firstdate,type = "lead"), by = product_id]
    sampleData[is.na(lastdate), lastdate := firstdate]
    # Arun's one step: sampleData[, lastdate := shift(firstdate, type="lead", fill=firstdate[.N]), by = product_id]
    

    Then create a new ID for every change in price:

    sampleData[, price_id := cumsum(c(0,diff(price) != 0)), by = product_id]
    

    Then calculate your groupwise functions, by product and price run:

    sampleData[,
               .(
                 price = unique(price),
                 sum_qty = sum(qty_ordered),
                 date_diff = max(lastdate) − min(firstdate) 
               ),
               by = .(
                 product_id,
                 price_id
               )
               ]
    
       product_id price_id price sum_qty date_diff
    1:       1000        0 2.490       4   21 days
    2:       1000        1 1.743       1   61 days
    3:       1000        2 2.490       2   33 days
    4:       1002        0 2.093       3   28 days
    5:       1002        1 2.110       4   31 days
    6:       1002        2 2.970       1    0 days
    

    I think the last price change for 1000 is only 33 days, and the preceding one is 61 (not 60). If you include the first day it is 22, 62 and 34, and the line should read date_diff = max(lastdate) − min(firstdate) + 1