rcluster-analysisknnhierarchical-clusteringmatchit

Pair-cluster across many variables, respecting pre-existing grouping variable


I have a tibble with an id column, a G grouping variable, and 300 numeric variables.

I want a method that clusters the raws to the point that each row is matched/paired in a cluster with another within each grouping variable. This should be a 1kNN. Spare raws in odd groups can be left out of the clusters.

So, if in a group there are 4 raws, then there will be 2 clusters of 2. If there are 5 raws, then 2 clusters of 2 and a spare raw.

I think I like the Mahalanobis distance for clustering but I am open to an alternative proposal.

I think that a diagnostic variable with the intra-cluster Mahalanobis could help, too.

Technically speaking, MatchIt does something very similar, over-imposing a binary classification to the raws. I don't want the need of such classification.

Example:

tibble(
  id = c(1:8),
  g = rep(c("A","B"),4),
  v1 = rnorm(8),
  v2 = rnorm(8),
  v3 = rnorm(8)
) -> obs

How ideally would look

obs %>%
  mutate(cluster = sample(c(1:2),replace = F) %>% rep(2),
         .by = g) %>%
  mutate(pair = str_c(g,cluster)) %>%
  arrange(pair)

Solution

  • This is called non-bipartite matching. The nbpMatching package implements it.

    You can also do a greedy version of it yourself, which seeks pairs with the closest distance first. First, create an NxN distance matrix, and then find the closest pair and record it. Deny the ability of those units to be in future pairs. Repeat until all units have been matched. Here is some code that does this using the Mahalanobis distance.

    obs$pair <- NA
    
    #Create distance matrix of variables
    d <- MatchIt::mahalanobis_dist(~v1 + v2 + v3, data = obs)
    
    #Deny matches that belong to different clusters
    d[!outer(obs$g, obs$g, "==")] <- Inf
    
    #Deny self-matching
    diag(d) <- Inf
    
    k <- 1
    repeat {
      min_pos <- arrayInd(which.min(d), dim(d))
      
      if (!is.finite(d[min_pos])) break
      
      obs$pair[drop(min_pos)] <- k
      d[drop(min_pos),] <- Inf
      d[,drop(min_pos)] <- Inf
      k <- k + 1
    }
    

    You can also do this more efficiently by splitting the problem into smaller pieces (i.e., within groups instead of denying cross-group matches).