I'm trying to cluster similar patterns in a list using Affinity Propagation clustering method. The self_pat is a list containing 80K patterns that needs to be clustered. I'm using the following code:
self_pat = np.asarray(self_pat) #So that indexing with a list will work
lev_similarity = -1*np.array([[calculate_levenshtein_distance(w1,w2) for w1 in self_pat] for w2 in self_pat])
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
exemplar = words_pat[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(words_pat[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))
The calculate_levenshtein_distance
function is as follows:
def calculate_levenshtein_distance(str_1, str_2):
"""
The Levenshtein distance is a string metric for measuring the difference between two sequences.
It is calculated as the minimum number of single-character edits necessary to transform one string into another
"""
distance = 0
buffer_removed = buffer_added = 0
for x in ndiff(str_1, str_2):
code = x[0]
# Code ? is ignored as it does not translate to any modification
if code == ' ':
distance += max(buffer_removed, buffer_added)
buffer_removed = buffer_added = 0
elif code == '-':
buffer_removed += 1
elif code == '+':
buffer_added += 1
distance += max(buffer_removed, buffer_added)
return distance
The above program uses 3 loops for execution and thus takes more time to do the clustering. Is there any way I can reduce the complexity of the program?
For smaller datasets, the time to completions is usually fine; for very large datasets, the time it takes to complete a job is basically intolerable. Clustering doesn't scale well, as you are finding out. Maybe you can just take a random sample from your full dataset.
# Fraction of rows
# here you get .25 % of the rows
df.sample(frac = 0.25)