This is my first stackoverflow post after years of passive browsing. I am stuck on this problem and it's driving me nuts! Thanks so much for helping.
I have a data frame of inventory supply and demand bucketed by age. I have collected this data across time for many products. Demand for inventory of a certain age can be met with supply of the same age or younger. I am trying to calculate how much demand can be met with supply within each age bucket, filled youngest to oldest.
The data frame will be large (10^7 rows) so I am trying to do this with dplyr
, mutate
, lag
, and cumsum
versus a loop, which I suspect will be slow.
Here is a sample group within my data set (product and date groupings omitted):
library(dplyr)
Inventory <- data.frame(
Age = c(90, 120, 270, 365, Inf),
Demand = c(0, 5000, 25, 5000, 10),
Supply = c(4000, 50, 4000, 300, 0))
View(Inventory)
The result I am expecting is:
Result <- Inventory
Result$Start = c(0, 4000, 0, 3975, 0)
Result$In = c(4000, 50, 4000, 300, 0)
Result$Out = c(0, 4050, 25, 4275, 0)
Result$End = c(4000, 0, 3975, 0, 0)
Result$Short = c(0, 950, 0, 725, 10)
View(Result)
I applied standard inventory calculations above:
I am having no luck using dplyr, but I think there is a solution using clever combinations of max, min, lag, and cumsum.
If speed becomes an issue over a large number of rows then potentially the fastest way to deal with iterating calculations is via Rcpp
.
You essentially need a cummulative-sum-but-floored-at-zero function, which adds the Supply - Demand outcome of each day to the last total and zeros it if it's negative. Here's a cumnominus
trial function which gives the right table and can be used in dplyr
:
library(dplyr)
cumnominus <- Rcpp::cppFunction("NumericVector cumnominus(NumericVector x) {
int n = x.size();
NumericVector sumout(n);
sumout[0] = (x[0] < 0) ? 0 : x[0];
for(int i = 1; i < n; i++) {
sumout[i] = (x[i] < 0) ? 0 : x[i] + sumout[i - 1];
}
return sumout;
}")
Inventory |>
mutate(In = Supply,
End = cumnominus(Supply - Demand),
Start = lag(End, default = 0),
Short = pmax(0, Demand - (Start + Supply)),
Out = pmin(Demand, Start + In)) |>
select(Age, Demand, Supply, Start, In, Out, End, Short)
#> Age Demand Supply Start In Out End Short
#> 1 90 0 4000 0 4000 0 4000 0
#> 2 120 5000 50 4000 50 4050 0 950
#> 3 270 25 4000 0 4000 25 3975 0
#> 4 365 5000 300 3975 300 4275 0 725
#> 5 Inf 10 0 0 0 0 0 10
Result
#> Age Demand Supply Start In Out End Short
#> 1 90 0 4000 0 4000 0 4000 0
#> 2 120 5000 50 4000 50 4050 0 950
#> 3 270 25 4000 0 4000 25 3975 0
#> 4 365 5000 300 3975 300 4275 0 725
#> 5 Inf 10 0 0 0 0 0 10
As a bit of a test against an R-only loop on a dataframe of 5m rows it takes around 0.05s compared to R-loop of 8.5s:
cumnominus_r <- function(x) {
out_sum <- integer(length(x))
out_sum[1] <- max(0, x[1])
for (i in 2:length(x)) {
out_sum[i] <- ifelse(x[i] < 0, 0, out_sum[i - 1] + x[i])
}
out_sum
}
big_df <- tibble(
Demand = sample(seq(1000, 6000, 500), 5000000, replace = TRUE),
Supply = sample(seq(1000, 6000, 500), 5000000, replace = TRUE)
)
bench::mark(
Rcpp_fun = big_df |>
mutate(End = cumnominus(Supply - Demand)),
R_only_fun = big_df |>
mutate(End = cumnominus_r(Supply - Demand))
)
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 Rcpp_fun 43.15ms 52.95ms 16.1 77.7MB 8.94
#> 2 R_only_fun 8.59s 8.59s 0.116 95.4MB 20.8