pythonpandasscikit-learnseaborntsne

How do I color clusters after k-means and TSNE in either seaborn or matplotlib?


I have a dataframe that look something like this:

transformed_centroids = model2.fit_transform(everything)
df = pd.DataFrame()
df["y"] = model.labels_
df["comp-1"] = transformed_centroids[-true_k:, 0]
df["comp-2"] = transformed_centroids[-true_k:, 1]

The 'y' are the k-means labels I want to color by, and "comp-1" and "comp-2" are the results from the TSNE model. When I try to plot like this:

sns.scatterplot(transformed_centroids[:-true_k, 0], transformed_centroids[:-true_k, 1], marker='x')
sns.scatterplot(df['comp-1'], df['comp-2'], marker='o', hue=df['y'])
plt.show()

It gives me this error:

ValueError: Length of values (2) does not match length of index (35104) (from this line: df["comp-1"] = transformed_centroids[-true_k:, 0])

This happens even if I do this:

sns.scatterplot(transformed_centroids[:-true_k, 0], transformed_centroids[:-true_k, 1], marker='x')
sns.scatterplot(df['comp-1'], df['comp-2'], marker='o', hue=df.y.astype('category').cat.codes)
plt.show()

I've tried several other pieces of code scattered around random tutorials and here, but I haven't found a solution. I feel silly having successfully completed the clustering but failing on the colors.

EDIT: I realized I was using the wrong plot-points. The updates code and error is below:

df["y"] = model.labels_
df["comp-1"] = transformed_centroids[:, 0]
df["comp-2"] = transformed_centroids[:, 1]

ValueError: Length of values (35106) does not match length of index (35104)

I'm not sure where the two dropped data-points are being... dropped.

EDIT2: Here is the TSNE code:

centroids = model.cluster_centers_
tweets_df2['labels'] = model.labels_
everything = np.concatenate((X.todense(), centroids))

tsne_init = 'pca'  # could also be 'random'
tsne_perplexity = 20.0
tsne_early_exaggeration = 4.0
tsne_learning_rate = 1000
model2 = TSNE(n_components=2, random_state=0, init=tsne_init, perplexity=tsne_perplexity,
              early_exaggeration=tsne_early_exaggeration, learning_rate=tsne_learning_rate)
transformed_centroids = model2.fit_transform(everything)
df = pd.DataFrame()

I took this code from another stacked overflow post and fit it to my data so I can't explain it 100%, I just know I needed to use TSNE to get my data-points to become 2D plottable since the data was words vectorized using TD-IDF


Solution

  • With help from @tdy, I realized one of the solutions tried a little while ago was the solution I needed. My main problem was my edit 2, I wasn't graphing the right set of data. I changed the df to this:

    df["y"] = model.labels_
    df["comp-1"] = transformed_centroids[:-2, 0]
    df["comp-2"] = transformed_centroids[:-2, 1]
    

    of course, this is the same as this for my 2-cluster code:

    df["y"] = model.labels_
    df["comp-1"] = transformed_centroids[:true_k, 0]
    df["comp-2"] = transformed_centroids[:true_k, 1]
    

    where true_k is the variable representing how many k-means clusters I have. I had this solution but changed it because I thought getting rid of the true_k would solve my 2-variable problem and I never reverted it. I just needed to do this with the right transformed_centroids[] slice and everything should run smoothly in 7 minutes when it's done melting my CPU... :)