pythonpython-3.xpattern-matchingcluster-analysisaffinity

How can I improve the time complexity of Affinity Propagation problem?


I'm trying to cluster similar patterns in a list using Affinity Propagation clustering method. The self_pat is a list containing 80K patterns that needs to be clustered. I'm using the following code:

self_pat = np.asarray(self_pat) #So that indexing with a list will work
lev_similarity = -1*np.array([[calculate_levenshtein_distance(w1,w2) for w1 in self_pat] for w2 in self_pat])

affprop = AffinityPropagation(affinity="precomputed", damping=0.5)

affprop.fit(lev_similarity)

 

for cluster_id in np.unique(affprop.labels_):

    exemplar = words_pat[affprop.cluster_centers_indices_[cluster_id]]

    cluster = np.unique(words_pat[np.nonzero(affprop.labels_==cluster_id)])

    cluster_str = ", ".join(cluster)

    print(" - *%s:* %s" % (exemplar, cluster_str))

The calculate_levenshtein_distance function is as follows:

def calculate_levenshtein_distance(str_1, str_2):
    """
        The Levenshtein distance is a string metric for measuring the difference between two sequences.
        It is calculated as the minimum number of single-character edits necessary to transform one string into another
    """
    distance = 0
    buffer_removed = buffer_added = 0
    for x in ndiff(str_1, str_2):
        code = x[0]
        # Code ? is ignored as it does not translate to any modification
        if code == ' ':
            distance += max(buffer_removed, buffer_added)
            buffer_removed = buffer_added = 0
        elif code == '-':
            buffer_removed += 1
        elif code == '+':
            buffer_added += 1
    distance += max(buffer_removed, buffer_added)
    return distance

The above program uses 3 loops for execution and thus takes more time to do the clustering. Is there any way I can reduce the complexity of the program?


Solution

  • For smaller datasets, the time to completions is usually fine; for very large datasets, the time it takes to complete a job is basically intolerable. Clustering doesn't scale well, as you are finding out. Maybe you can just take a random sample from your full dataset.

    # Fraction of rows
    # here you get .25 % of the rows
    df.sample(frac = 0.25)