r optimization cluster-analysis hierarchical-clustering operations-research

How do I make clusters based on a fixed capacity of each cluster in R?

So I have a dataset with 600 points, their latitude, longitude, and demands. I have to make clusters such that for each cluster the points will be near each other and the total capacity of that cluster will not exceed a certain limit.

A sample dataset for the problem:

set.seed(123)
id<- seq(1:600)
lon <- rnorm(600, 88.5, 0.125)
lat <- rnorm(600, 22.4, 0.15)
demand <- round(rnorm(600, 40, 20))

df<- data.frame(id, lon, lat, demand)

What I want to have approximately:

What I am getting (cluster boundaries are approximated):

The code I've written:

library(tidyverse)
constrained_cluster <- function(df,capacity=170){
  lon_max <- max(df$lon)
  lat_max <- max(df$lat)

#Calculating the distance between an extreme point and all other points

  df$distance<-6377.83*acos(sin(lat_max*p)*sin(df$lat*p) + cos(lat_max*p)*cos(df$lat*p) * cos((lon_max-df$lon)*p))
  df<- df[order(df$distance, decreasing = FALSE),]
  d<-0
  cluster_number<-1
  cluster_list<- c()
  i<-1
#Writing a loop to form the cluster which will fill up the cluster_list accordingly
  while (i <= length(df$distance)){
    d <- d+ df$demand[i]
    if(d<=capacity){
      cluster_list[i] <- cluster_number
      i<- i+1
    }
    else{
      cluster_number <- cluster_number+1
      d <- 0
      i<-i
    }
  }
#Return a dataframe with the list of clusters
  return(cbind(df,as.data.frame(cluster_list)))
}

df_with_cluster<- constrained_cluster(df, capacity = 1000)

Solution

Here is one possible approach, in which I treat the problem directly as an optimisation problem.

Suppose you have a feasible partition of the rows into groups. Not necessarily a good one, but one that does not violate the constraints. For every group (cluster), you compute the centre. Then you compute the distances of all points in a group to the group's centre, and sum them. In this way, you have a measure of quality of your initial partition.

Now, randomly pick on row, and move it into another group. You get new solution. Complete the steps as before, and compare the new solution's quality with the previous one. If it's better, keep it. If it's worse, stay with the old solution. Now repeat this whole procedure for a fixed number of iterations.

This process is called a Local Search. Of course, it is not guaranteed it will take you to an optimum solution. But it will likely give you a good solution. (Note that k-means implementations are typically stochastic as well, and there is no guaranty for an "optimal" partition.)

The good thing about a Local Search is that it gives you much flexibility. For instance, I assumed you started with a feasible solution. Suppose you make a random move (i.e. move one row into another cluster), but now this new cluster is to big. You could now simply discard this new, infeasible solution, and draw a new one.

Here is a code example, really just an outline; but with luck it is useful for you.

set.seed(123)
id<- seq(1:600)
lon <- rnorm(600, 88.5, 0.125)
lat <- rnorm(600, 22.4, 0.15)
demand <- round(rnorm(600, 40, 20))

df<- data.frame(id, lon, lat, demand)

Fix a number of clusters, k.

k <- 5

Start with kmeans and plot the solution.

par(mfrow = c(1, 3))
km <- kmeans(cbind(df$lat, df$lon), centers = k)

cols <- hcl.colors(n = k, "Cold")
plot(df$lon,
     df$lat,
     type = "p", pch = 19, cex = 0.5,
     main = "kmeans")
for (i in seq_len(k)) {
    lines(df$lon[km$cluster == i],
          df$lat[km$cluster == i],
          type = "p", pch = 19,
          col = cols[i])
}

Now a Local Search. I use an implementation in package NMOF (which I maintain).

library("NMOF")

## a random initial solution
x0 <- sample(1:k, length(id), replace = TRUE)
X <- as.matrix(df[, 2:3])

The objective function: it takes a partition x and computes the sum of distances, for all clusters.

sum_diff <- function(x, X, k, ...) {
    groups <- seq_len(k)
    d_centre <- numeric(k)
    for (g in groups) {
        centre <- colMeans(X[x == g, ], )
        d <- t(X[x == g, ]) - centre
        d_centre[g] <- sum(sqrt(colSums(d * d)))
    }
    sum(d_centre)
}

The neighbourhood function: it takes a partition and moves one row into another cluster.

nb <- function(x, k, ...) {
    groups <- seq_len(k)

    x_new <- x
    p <- sample.int(length(x), 1)
    g_ <- groups[-x_new[p]]    
    x_new[p] <- g_[sample.int(length(g_), 1)]
    x_new
}

Run the Local Search. I actually use a method called Threshold Accepting, which is based on Local Search, but can move away from local minima. See ?NMOF::TAopt for references on that method.

sol <- TAopt(sum_diff,
             list(x0 = x0,
                  nI = 20000,
                  neighbour = nb),
             X = as.matrix(df[, 2:3]),
             k = k)

Plot the solution.

plot(df$lon,
     df$lat,
     type = "p", pch = 19, cex = 0.5,
     main = "Local search")
for (i in seq_len(k)) {
    lines(df$lon[sol$xbest == i],
          df$lat[sol$xbest == i],
          type = "p", pch = 19, 
          col = cols[i])
}

Now, one way to add a capacity constraint. We start with a feasible solution.

## CAPACITY-CONSTRAINED
max.demand <- 6600
all(tapply(df$demand, x0, sum) < max.demand)
## TRUE

The constraint is handled in the neighbourhood. If the new solution exceeds the capacity, it is discarded.

nb_constr <- function(x, k, demand, max.demand,...) {
    groups <- seq_len(k)
    x_new <- x
    p <- sample.int(length(x), 1)
    g_ <- groups[-x_new[p]]    
    x_new[p] <- g_[sample.int(length(g_), 1)]

    ## if capacity is exceeded, return
    ## original solution
    if (sum(demand[x_new == x_new[p]]) > max.demand)
        x
    else
        x_new
}

Run the method and compare the results.

sol <- TAopt(sum_diff,
             list(x0 = x0,
                  nI = 20000,
                  neighbour = nb_constr),
             X = as.matrix(df[, 2:3]),
             k = k,
             demand = df$demand,
             max.demand = max.demand)

plot(df$lon,
     df$lat,
     type = "p", pch = 19, cex = 0.5,
     main = "Local search w/ constraint")
for (i in seq_len(k)) {
    lines(df$lon[sol$xbest == i],
          df$lat[sol$xbest == i],
          type = "p", pch = 19,
          col = cols[i])
}

all(tapply(df$demand, sol$xbest, sum) < max.demand)
## TRUE

This is really just an example and could be improved. For instance, the objective function here recomputes the distance of all groups, when it would only need to look at the changed groups.