I have a dataframe that look something like this:
transformed_centroids = model2.fit_transform(everything)
df = pd.DataFrame()
df["y"] = model.labels_
df["comp-1"] = transformed_centroids[-true_k:, 0]
df["comp-2"] = transformed_centroids[-true_k:, 1]
The 'y' are the k-means labels I want to color by, and "comp-1" and "comp-2" are the results from the TSNE model. When I try to plot like this:
sns.scatterplot(transformed_centroids[:-true_k, 0], transformed_centroids[:-true_k, 1], marker='x')
sns.scatterplot(df['comp-1'], df['comp-2'], marker='o', hue=df['y'])
plt.show()
It gives me this error:
ValueError: Length of values (2) does not match length of index (35104) (from this line: df["comp-1"] = transformed_centroids[-true_k:, 0])
This happens even if I do this:
sns.scatterplot(transformed_centroids[:-true_k, 0], transformed_centroids[:-true_k, 1], marker='x')
sns.scatterplot(df['comp-1'], df['comp-2'], marker='o', hue=df.y.astype('category').cat.codes)
plt.show()
I've tried several other pieces of code scattered around random tutorials and here, but I haven't found a solution. I feel silly having successfully completed the clustering but failing on the colors.
EDIT: I realized I was using the wrong plot-points. The updates code and error is below:
df["y"] = model.labels_
df["comp-1"] = transformed_centroids[:, 0]
df["comp-2"] = transformed_centroids[:, 1]
ValueError: Length of values (35106) does not match length of index (35104)
I'm not sure where the two dropped data-points are being... dropped.
EDIT2: Here is the TSNE code:
centroids = model.cluster_centers_
tweets_df2['labels'] = model.labels_
everything = np.concatenate((X.todense(), centroids))
tsne_init = 'pca' # could also be 'random'
tsne_perplexity = 20.0
tsne_early_exaggeration = 4.0
tsne_learning_rate = 1000
model2 = TSNE(n_components=2, random_state=0, init=tsne_init, perplexity=tsne_perplexity,
early_exaggeration=tsne_early_exaggeration, learning_rate=tsne_learning_rate)
transformed_centroids = model2.fit_transform(everything)
df = pd.DataFrame()
I took this code from another stacked overflow post and fit it to my data so I can't explain it 100%, I just know I needed to use TSNE to get my data-points to become 2D plottable since the data was words vectorized using TD-IDF
With help from @tdy, I realized one of the solutions tried a little while ago was the solution I needed. My main problem was my edit 2, I wasn't graphing the right set of data. I changed the df to this:
df["y"] = model.labels_
df["comp-1"] = transformed_centroids[:-2, 0]
df["comp-2"] = transformed_centroids[:-2, 1]
of course, this is the same as this for my 2-cluster code:
df["y"] = model.labels_
df["comp-1"] = transformed_centroids[:true_k, 0]
df["comp-2"] = transformed_centroids[:true_k, 1]
where true_k
is the variable representing how many k-means clusters I have. I had this solution but changed it because I thought getting rid of the true_k
would solve my 2-variable problem and I never reverted it. I just needed to do this with the right transformed_centroids[]
slice and everything should run smoothly in 7 minutes when it's done melting my CPU... :)