pythoncluster-analysissilhouette

Python. How to import my own dataset to "k means" algorithm


I want to import my own data (sentences which are located in a .txt file) into this example algorithm, which can be found at: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

The problem is that this code uses a make_blobs dataset and i have a hard time understanding how to replace it with data from .txt file.

All I predict is that I need to replace this piece of code right here:

X, y = make_blobs(n_samples=500,
          n_features=2,
          centers=4,
          cluster_std=1,
          center_box=(-10.0, 10.0),
          shuffle=True,
          random_state=1)  # For reproducibility

Also I do not understand these variables X, y . I assume that X is an array of data, and what about y?

Should I just assign everything to the X as like this and that example code would work? But what about those make_blobs features like centers, n_features etc.? Do I need to specify them somehow differently?

# open and read from the txt file
path = "C:/Users/user/Desktop/sentences.txt"
file = open(path, 'r')
# assign it to the X
X = file.readlines() 

Any help is appreciated!


Solution

  • Firstly, you need to create a mapping of your words to a number that your k-means algorithm can use.

    For example:

    I ride a bike and I like it.
    1   2  3  4    5  1  6   7  # <- number ids
    

    After that you have a new embedding for you dataset and you can apply k-means. If you want a homogeneous appearance for your sample you must convert them to one-hot-representation (which is that you create a N-length array for each sample, where N is the total number of unique words you have, which has one to the corresponding position which is the same as the index of the sample).

    Example for the above for N = 7 would be

    1 -> 1000000
    2 -> 0100000
    ...
    

    So, now you can have a X variable containing your data in a proper format. You don't need y which is the corresponding labels for your samples.

    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)
    silhouette_avg = silhouette_score(X, cluster_labels)
    ...