I have a tibble with an id
column, a G
grouping variable, and 300 numeric variables.
I want a method that clusters the raws to the point that each row is matched/paired in a cluster with another within each grouping variable. This should be a 1kNN. Spare raws in odd groups can be left out of the clusters.
So, if in a group there are 4 raws, then there will be 2 clusters of 2. If there are 5 raws, then 2 clusters of 2 and a spare raw.
I think I like the Mahalanobis distance for clustering but I am open to an alternative proposal.
I think that a diagnostic variable with the intra-cluster Mahalanobis could help, too.
Technically speaking, MatchIt
does something very similar, over-imposing a binary classification to the raws. I don't want the need of such classification.
Example:
tibble(
id = c(1:8),
g = rep(c("A","B"),4),
v1 = rnorm(8),
v2 = rnorm(8),
v3 = rnorm(8)
) -> obs
How ideally would look
obs %>%
mutate(cluster = sample(c(1:2),replace = F) %>% rep(2),
.by = g) %>%
mutate(pair = str_c(g,cluster)) %>%
arrange(pair)
This is called non-bipartite matching. The nbpMatching
package implements it.
You can also do a greedy version of it yourself, which seeks pairs with the closest distance first. First, create an NxN distance matrix, and then find the closest pair and record it. Deny the ability of those units to be in future pairs. Repeat until all units have been matched. Here is some code that does this using the Mahalanobis distance.
obs$pair <- NA
#Create distance matrix of variables
d <- MatchIt::mahalanobis_dist(~v1 + v2 + v3, data = obs)
#Deny matches that belong to different clusters
d[!outer(obs$g, obs$g, "==")] <- Inf
#Deny self-matching
diag(d) <- Inf
k <- 1
repeat {
min_pos <- arrayInd(which.min(d), dim(d))
if (!is.finite(d[min_pos])) break
obs$pair[drop(min_pos)] <- k
d[drop(min_pos),] <- Inf
d[,drop(min_pos)] <- Inf
k <- k + 1
}
You can also do this more efficiently by splitting the problem into smaller pieces (i.e., within groups instead of denying cross-group matches).