rstatisticsspatialhierarchical-clustering

How can I compute a spatial distance matrix based on control variables? (R)


I'm interested in computing a distance matrix with a custom distance function. This function should take into account spatial data and two control variables. This distance can be Euclidean. The details are following:

I have a data of sellers and buyers. This spatial dataset contains cities, coordinates, purchased quantity, and two control variables. I want to apply a Hierarchical Cluster to determine the "geographical markets", and for this, I'd like to compute a Distance Matrix that considers the two control variables I've mentioned.

I've tried this, but I'm not sure if I'm right with the object W.

# Sample data (because is private info).
set.seed(123)
n <- 100
cities <- c("City1", "City2", "City3", "City4", "City5")
seller_city <- sample(cities, n, replace = TRUE)
buyer_city <- sample(cities, n, replace = TRUE)
seller_coords <- data.frame(lon = rnorm(n, -80, 1), lat = rnorm(n, 40, 1))
buyer_coords <- data.frame(lon = rnorm(n, -80, 1), lat = rnorm(n, 40, 1))
quantity <- rpois(n, 10)
var1 <- rnorm(n, 0, 1) #First control variable.
var2 <- rnorm(n, 0, 1) #Second control variable.
df <- data.frame(seller_city, buyer_city, seller_coords, buyer_coords, quantity, var1, var2)

# Compute distance matrix
city_dist <- distm(x =df[,c("lon", "lat")] ,
                             y = df[,c("lon.1", "lat.1")])
city_dist <- (city_dist - mean(city_dist)) / sd(city_dist) #Normalising, because its units differ to the control variables.
var_dist <- as.matrix(dist(df %>% select(var1, var2)))
var_dist <- (var_dist - mean(var_dist)) / sd(var_dist) #Normalising, because its units differ to the control variables.
W <- city_dist + var_dist # sum up


# Perform hierarchical clustering
hc <- hclust(as.dist(W), 
             method = "ward.D2")

The Idea is the compute the distance between cities i and j with the following formula: formula

where x is the longitude, y is the latitude, v1 is the control variable 1, and v2 is the control variable 2.


Solution

  • You could use the package use_dist with its function dist_make to provide a custom distance function.

    In your example, you could use it like this

    library(usedist)
    
    # ...
    
    distance_function <- function (v1, v2) {
       (v1[["lon"]] - v2[["lon"]])**2
      +(v1[["lat"]] - v2[["lat"]])**2 
      +(v1[["var1"]] - v2[["var2"]])**2
      +(v1[["var1"]] - v2[["var2"]])**2;
    }
    
    # Collect the data points in one dataframe
    df <- data.frame(seller_city, buyer_city, seller_coords, buyer_coords, quantity, var1, var2)
    
    # Calculate the distance matrix
    city_dist <- dist_make(df, distance_function)
    
    # Apply hierarchical clustering
    hc <- hclust(as.dist(city_dist), method = "ward.D2")
    

    Using this approach you can use any arbitrary distance function that you would like. However, function you chose looks pretty similar to a standard euclidian distance, make sure to check if the indices are correct