Sample Data:
product_id <- c("1000","1000","1000","1000","1000","1000", "1002","1002","1002","1002","1002","1002")
qty_ordered <- c(1,2,1,1,1,1,1,2,1,2,1,1)
price <- c(2.49,2.49,2.49,1.743,2.49,2.49, 2.093,2.093,2.11,2.11,2.11, 2.97)
date <- c("2/23/15","2/23/15", '3/16/15','3/16/15','5/16/15', "6/18/15", "2/19/15","3/19/15","3/19/15","3/19/15","3/19/15","4/19/15")
sampleData <- data.frame(product_id, qty_ordered, price, date)
I would like to identify every time when a change in a price occurred. Also, I would like to sum() the total qty_ordered between those two price change dates. For example,
For product_id == "1000"
, price changed occurred on 3/16/15 from $2.49 to $1.743. The total qty_ordered is 1+2+1=4;
the difference between those two earliest date of price change is from 2/23/15 to 3/16/15 which is 21 days.
So the New Data Frame should be:
product_id sum_qty_ordered price date_diff
1000 4 2.490 21
1000 1 1.743 61
1000 2 2.490 33
Here are what I have tried:
**NOTE: for this case, a simple "dplyr::group_by
" will not work since it will ignore the date effect.
1) I found this code from Determine when columns of a data.frame change value and return indices of the change: This is to identify every time when the price changed, which identify the first date when the price changed for each product.
IndexedChanged <- c(1,which(rowSums(sapply(sampleData[,3],diff))!=0)+1)
sampleData[IndexedChanged,]
However, I am not sure how to calculate the sum(qty_ordered)
and the date difference for each of those entries if I use that code.
2) I tried to write a WHILE loop to temporarily store each batch of product_id
, price, range of dates (e.g. a subset of data frame with one product_id
, one price, and all entries ranged from the earliest date of price change till the last date of price before it changed),
and then, summarise that subset to get sum(sum_qty_ordered
) and the date diff.
However, I think I always am confused by WHILE and FOR, so my code has some problems in it. Here is my code:
create an empty data frame for later data storage
NewData_Ready <- data.frame(
product_id = character(),
price = double(),
early_date = as.Date(character()),
last_date=as.Date(character()),
total_qty_demanded = double(),
stringsAsFactors=FALSE)
create a temp table to store the batch price order entries
temp_dataset <- data.frame(
product_id = character(),
qty_ordered = double(),
price = double(),
date=as.Date(character()),
stringsAsFactors=FALSE)
loop: This is messy...and probably not make sense, so I do really help on this.
for ( i in unique(sampleData$product_id)){
#for each unique product_id in the dataset, we are gonna loop through it based on product_id
#for first product_id which is "1000"
temp_table <- sampleData[sampleData$product_id == "i", ] #subset dataset by ONE single product_id
#this dataset only has product of "1000" entries
#starting a new for loop to loop through the entire entries for this product
for ( p in 1:length(temp_table$product_id)){
current_price <- temp_table$price[p] #assign current_price to the first price value
#assign $2.49 to current price.
min_date <- temp_table$date[p] #assign the first date when the first price change
#assign 2015-2-23 to min_date which is the earliest date when price is $2.49
while (current_price == temp_table$price[p+1]){
#while the next price is the same as the first price
#that is, if the second price is $2.49 is the same as the first price of $2.49, which is TRUE
#then execute the following statement
temp_dataset <- rbind(temp_dataset, temp_table[p,])
#if the WHILE loop is TRUE, means every 2 entries have the same price
#then combine each entry when price is the same in temp_table with the temp_dataset
#if the WHILE loop is FALSE, means one entry's price is different from the next one
#then stop the statement at the above, but do the following
current_price <- temp_table$price[p+1]
#this will reassign the current_price to the next price, and restart the WHILE loop
by_idPrice <- dplyr::group_by(temp_dataset, product_id, price)
NewRow <- dplyr::summarise(
early_date = min(date),
last_date = max(date),
total_qty_demanded = sum(qty_ordered))
NewData_Ready <- rbind(NewData_Ready, NewRow)
}
}
}
I have searched a lot on related questions but I have not found anything that are related to this problem yet. If you have some suggestions, please let me know. Also, please provide some suggestions on the solution to my questions. I would greatly appreciate your time and help!
Here is my R version:
platform x86_64-apple-darwin13.4.0
arch x86_64
os darwin13.4.0
system x86_64, darwin13.4.0
status
major 3
minor 3.1
year 2016
month 06
day 21
svn rev 70800
language R
version.string R version 3.3.1 (2016-06-21)
nickname Bug in Your Hair
Using data.table
:
library(data.table)
setDT(sampleData)
Some Preprocessing:
sampleData[, firstdate := as.Date(date, "%m/%d/%y")]
Based on how you calculate date diff, we are better off creating a range of dates for each row:
sampleData[, lastdate := shift(firstdate,type = "lead"), by = product_id]
sampleData[is.na(lastdate), lastdate := firstdate]
# Arun's one step: sampleData[, lastdate := shift(firstdate, type="lead", fill=firstdate[.N]), by = product_id]
Then create a new ID for every change in price:
sampleData[, price_id := cumsum(c(0,diff(price) != 0)), by = product_id]
Then calculate your groupwise functions, by product and price run:
sampleData[,
.(
price = unique(price),
sum_qty = sum(qty_ordered),
date_diff = max(lastdate) − min(firstdate)
),
by = .(
product_id,
price_id
)
]
product_id price_id price sum_qty date_diff
1: 1000 0 2.490 4 21 days
2: 1000 1 1.743 1 61 days
3: 1000 2 2.490 2 33 days
4: 1002 0 2.093 3 28 days
5: 1002 1 2.110 4 31 days
6: 1002 2 2.970 1 0 days
I think the last price change for 1000
is only 33 days, and the preceding one is 61 (not 60). If you include the first day it is 22, 62 and 34, and the line should read date_diff = max(lastdate) − min(firstdate) + 1