pythoncluster-analysishdbscan

HDBSCAN handling of large datasets


I am trying to implement a clustering on a large dataset consisting of 146,000 observations, using the HDBSCAN algorithm. When I cluster these observations with the (default) Minkowski/Euclidean distance measure, clustering of the entire data goes fine and takes only 8 seconds. However, I am trying to perform clustering with my own metric. This goes fine when operating on a subset of the data, although a lot slower. However, when trying to implement it on the full dataset I immediately get a memory error. This makes sense, given that due to the size of the dataset a pairwise distance matrix would take up around 150GB. However, this makes me wonder how there is no such issue using the default metric, while looking at the HDBSCAN source code show that in that case also Sklearn's Pairwise distances is called, which will return the entire matrix. Additionally, I am wondering if there would be a workaround for my metric or if the only solution is to have access to +- 150GB of RAM.

The code for my metric and some results:

import hdbscan
import pandas as pd
import time
import numpy as np
from numpy.linalg import norm


def spex_distance(a, b):
    euclidean = norm(a[:2] - b[:2])  
    exp_vec_a, exp_vec_b = a[2:], b[2:]  
    cos_sim = np.dot(exp_vec_a, exp_vec_b) / (norm(exp_vec_a) * norm(exp_vec_b))
    if cos_sim > 0:
        return euclidean / cos_sim
    else:
        return np.inf


def main():
    data = pd.read_pickle(file_location)
    small_data = data[:1000]

    t0 = time.time()
    hdb_custom = hdbscan.HDBSCAN(metric=spex_distance)
    hdb_custom.fit(small_data)
    print(f"Time needed for clustering subset with custom metric: {time.time()-t0}") # 10 sec

    t0 = time.time()
    hdb_default = hdbscan.HDBSCAN()
    hdb_default.fit(small_data)
    print(f"Time needed for clustering subset with default metric: {time.time()-t0}") # 0.03 sec

    t0 = time.time()
    hdb_default.fit(data)
    print(f"Time needed for clustering full dataset with default metric: {time.time()-t0}") # 9 sec

    hdb_custom.fit(data) # fails with memory error

Solution

  • Metrics that are supported by sklearn's KDTree of BallTree can make use of those data structures to be both more memory efficient and much faster than computing the full distance matrix (assuming the data is low dimensional enough such that KDTrees of BallTrees are efficient). A custom metric will require an alternative code path, so that is much harder. In principle it is possible to use Prim's algorithm for the MST construction phase such that distances are computed on an "as needed" basis and less memory than the full pairwise distance matrix is required -- this is, in fact, how the code works for standard distance functions when data is too high dimensional for tree based approaches to be applicable. This is slower (scaling at O(N^2) instead of O(N log N)), but will work. Unfortunately the current code requires the distance metric used to be accessible via C/Cython for efficiency purposes. This rules out custom metrics. It would be possible to write your distance function in C or Cython, include it is the source files, and compile a custom version of hdbscan that includes your metric. This would allow a lower memory approach to work, but may be more work than you are willing to do.