I'm working with a large dataset (400M observations) of cars. Each entry has a VIN number, identifying it as a specific car. I group the dataset by VIN, so each group is 1 car (usually with multiple entries). I'm currently using the deployr package (I recently learned R using tidyverse).
I am trying to remove all observations where a vehicle increases in price as the car gets older (appreciates in value over time).
Let's look at 1 car as an example, I've removed all columns besides the relevant data_year (year in which the data was recorded) and veh_price (vehicle price) below:
ex_car <- structure(list(data_year = c(2003, 2008, 2009, 2009, 2010, 2012,
2013, 2015), veh_price = c(2341, 50, 50, 100, 2600, 450, 3750,
23000)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
))
The columns are already sorted by data_year (descending order). As mentioned above, I am attempting to filter out any rows where the vehicle price increases from the previous row.
data_year veh_price
2003 2341
2008 50
2009 50
2009 100
2010 2600
2012 450
2013 3750
the desired output:
data_year veh_price
2003 2341
2008 50
2009 50
I considered using mutate
and lead
to create a new row of the next value, calculating the difference, and then filtering all negative values (see below)
filtered_sample <- ex_car|>
mutate(
next_price = lead(veh_price),
diff_p = veh_price - next_price,
) |>
filter( diff_p >= 0)
However I'm concerned adding 2 columns is too inefficient to run on the large dataset. Is there a better, more efficient way of solving this problem? (I've heard data.table is optimized for larger operations, is there a way to use this package to do achieve the same goal?)
When I was first approaching this issue, I thought of iterating once through the vector, storing a temporary global lowest_price
variable (initialized as the first value in the vector). if the next price is lower or equal (<=), set it as the new lowest_price
, if it is greater (>), remove the value.
However, I couldn't figure out how to do this well in R.
Thanks for your time, please let me know if I failed to comprehensively describe the problem or if you need any more information.
UPDATE: x == cummin(.)
might be prone to R FAQ 7.31, a problem with tests of equality on high-precision floating point numbers. While not likely with this data, if the price
-like variable has high precision and/or you want to be a bit safer in this regards, use @GregorThomas's suggested alternative of diff(.) <= 0
, though this needs to be cumulative somehow:
# dplyr
ex_car %>%
group_by(VIN) %>%
filter(cumall(c(TRUE, diff(veh_price) <= 0)))
# data.table
as.data.table(ex_car)[, .SD[c(TRUE, cumsum(diff(veh_price) > 0) == 0),], by = .(VIN)]
# base
ind <- with(ex_car, !!ave(veh_price, VIN, FUN = function(z) c(TRUE, cumsum(diff(z) > 0) == 0)))
ex_car[ind,]
We can use cummin
here (relying on non-decreasing data_year
). (This is for cum
ulative min
, not a mis-spelling of the spice cumin ;-)
library(dplyr)
ex_car %>%
filter(veh_price == cummin(veh_price))
# # A tibble: 3 × 2
# data_year veh_price
# <dbl> <dbl>
# 1 2003 2341
# 2 2008 50
# 3 2009 50
Grouping by your VIN
:
ex_car %>%
group_by(VIN) %>%
filter(veh_price == cummin(veh_price))
### or with dplyr_1.1.0 or newer
ex_car %>%
filter(veh_price == cummin(veh_price), .by = VIN)
library(data.table)
as.data.table(ex_car)[ veh_price == cummin(veh_price),]
# data_year veh_price
# <num> <num>
# 1: 2003 2341
# 2: 2008 50
# 3: 2009 50
Grouping by your VIN
:
as.data.table(ex_car)[ veh_price == cummin(veh_price), by = .(VIN)]
ind <- with(ex_car, veh_price == cummin(veh_price))
ex_car[ind,]
# # A tibble: 3 × 2
# data_year veh_price
# <dbl> <dbl>
# 1 2003 2341
# 2 2008 50
# 3 2009 50
Gropuing by your VIN
requires the addition of ave
:
ind <- with(ex_car, ave(veh_price, VIN, FUN = function(z) z == cummin(z)))