I am trying to optimise the computational time used for computing multiple results with different amounts of clusters on the same data set using sklearn's AgglomerativeClustering
.
As indicated in sklearn agglomerative clustering: dynamically updating the number of clusters, it is possible to store the entire tree computed by AgglomerativeClustering
. Then, you can respecify the n_clusters
-parameter of the clustering object and simply extract the new clustering result of the same data set clustered into the new amount.
I am sorry if this is a trivial question, but I have very little experience with dealing with memory using Python. My question is how to specify the cache directory used by AgglomerativeClustering
. In the example in the link above, it is written as:
AgglomerativeClustering(n_clusters=10, memory='mycachedir', compute_full_tree=True)
What is 'mycachedir'
exactly? Do I need to replace it by my own cache directory, or does python create a new directory somewhere called 'mycachedir'
? If so, is this removed when my program ends? I would like the cache be removed once my program stops or ends. Again, I am sorry if this obvious.
I tried to run it with the string "mycachedir"
and Python does not raise an error. So where is this directory located? And how does it behave? E.g., is it removed once the program ends?
According to scikit-learn documentation, "if a string is given, it is the path to the caching directory."
As a matter of fact, caching is performed with the joblib.Memory
class of the joblib
package. The directory is created by os.makedirs(os.path.expanduser(memory))
where memory
is an AgglomerativeClustering
input argument. Though, it can be deleted with joblib.Memory.clear
, to the best of my knowledge, this is not the case when calling AgglomerativeClustering.fit
.
Using sklearn.AgglomerativeClustering
example,
import os
# EXTERNALS
from sklearn.cluster import AgglomerativeClustering
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])
memory_dir = "~/tmp/my_cached_memory_folder"
# relative path depending on your working directory
# (cf. `os.getcwd()`)
clustering = AgglomerativeClustering(memory=memory_dir).fit(X)
full_path = os.path.abspath(os.path.expanduser(memory_dir))
print("Cached memory directory: " f"{full_path}")
print(os.path.isdir(full_path))
# Cached memory directory: /home/remi_cuingnet/tmp/my_cached_memory_folder
# True
Note that you have to manually clear it.