I have a large table of cells (rows) that belong to a certain class. In a next step, the distribution of classes has to change and I want to identify which cells need to be moved between classes minimising the costs that occur by moving cells between classes.
library(tidyverse)
library(transport)
set.seed(19)
## classes: 1 - 5
## overall cost matrix for transport
costs <- matrix(expand_grid(a = 1:5, b = 1:5) %>% mutate(cost = abs(b - a)) %>%
pull(cost), ncol = 5, nrow =5)
## dataset with initial classes and individual transport costs into other classes
dat <- tibble(class = sample(rep(1:5, each = 1000), 200)) %>%
bind_cols(t(apply(., 1, function(x) abs(x - 1:5)*runif(5))) %>%
as_tibble() %>% setNames(paste0("cl", 1:5)))
The dataset with initial classes and costs to move the row into the other classes (cl1-5).
head(dat)
# A tibble: 6 × 6
class cl1 cl2 cl3 cl4 cl5
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 0.298 0 0.353 1.52 2.57
2 5 0.129 1.12 0.986 0.867 0
3 3 0.940 0.446 0 0.811 0.670
4 5 0.759 2.82 1.65 0.582 0
5 3 1.39 0.134 0 0.438 1.39
6 5 2.89 0.492 0.870 0.954 0
I first solved the mass transport problem, identifying the number of cells that need to be moved from a certain class into another:
## mass distribution of classes
prob0 <- dat %>% group_by(class) %>% summarise(p = n()/nrow(dat))
## new distribution of classes
prob1 <- tibble(class = sample(rep(1:5, each = 1000), 200)) %>%
group_by(class) %>% summarise(p = n()/nrow(dat))
## Solving transport problem of masses
trans <- transport(prob0$p, prob1$p, costs) %>% filter(from!=to) %>%
left_join(dat %>% group_by(class) %>%
summarise(rows = n()), by = join_by("from"=="class")) %>%
mutate(nrRows = round(rows*mass,0))
trans
from to mass rows nrRows
1 1 4 0.03 39 1
2 3 4 0.06 46 3
3 5 4 0.04 48 2
With this, I know how many cells should be moved from the initial class into a new class.
QUESTION: What method can I use to efficiently find the best rows to move according to the costs (minimisation costs for all changes of classes) that occur by moving the row?
I wonder of that problem can be solved directly without calculating the mass flow between classes, i.e. moving rows directly minimising the costs until the desired class distribution is achieved?
As already mentioned you can model it as binary transportation problem where sum of each row must be 1 and sum of each column #rows * wanted distribution.
m <- as.matrix(dat[,-1])
p0 <- prob1[["p"]]
res <- lpSolve::lp.transport(
cost.mat = m,
direction = "min",
row.signs = rep("==", nrow(m)),
row.rhs = rep(1, nrow(dat)),
col.signs = rep("==", ncol(m)),
col.rhs = nrow(dat) * p0
)
res[["objval"]] # cost of changes
new_groups <- apply(res[["solution"]], 1, function(x) which(x == 1)) # new group for each row
table(dat[["class"]], new_groups) # change matrix
change_rows <- which(new_groups != dat[["class"]])
if(length(change_rows) > 0)
changes <- data.frame(
row = change_rows,
old_group = dat[["class"]][change_rows],
new_group = new_groups[change_rows]
)
This obviously wont work if it is impossible to fit new distribution - nrow(dat) * p0
must be integer vector. In other cases you can use mco::nsga2
to minimize cost of group change and distribution error.