pythonseaborncluster-analysisdbscantsne

Calculating the cluster size in t-SNE


I've been working on t-SNE of my data using DBSCAN. I then assign the obtained values to the original dataframe and then plot it with seaborn scatterplot. This is the code:

from sklearn.manifold import TSNE

tsne_em = TSNE(n_components=3, perplexity=50.0, n_iter=1000, verbose=1).fit_transform(df_tsne)

from bioinfokit.visuz import cluster
cluster.tsneplot(score=tsne_em)

from sklearn.cluster import DBSCAN
get_clusters = DBSCAN(eps=4, min_samples=10).fit_predict(tsne_em)

filter_df['x'] = tsne_em[:,0]
filter_df['y'] = tsne_em[:,1]

g = sns.scatterplot(x='x', y='y', hue = 'Species', style = 'Gender', data=filter_df)
g.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.savefig('Seaborn-MF-Species-TSNE-EPS4.png', dpi=600, bbox_inches='tight')

This is how the image appears:

enter image description here

I have seen that people calculate the size of the cluster (number of cells, percentages, etc) and do other post-analysis stuff for which i haven't found any type of code. Does anybody now how i can for example circle the exact clusters, show the number of cells in them etc...I have several of these graphs and it would really help me to make the results in them look more understandable.


Solution

  • If it is the cluster size, you just need to tabulate the results of your DBSCAN, for example in this dataset:

    from sklearn.cluster import DBSCAN
    from sklearn.datasets import make_blobs
    from sklearn.manifold import TSNE
    import seaborn as sns
    
    X,y = make_blobs(n_samples = 200,centers=3, n_features= 5, random_state=99)
    
    tsne_em = TSNE(n_components=2, init='pca',learning_rate=1).fit_transform(X)
    get_clusters = DBSCAN(eps=2, min_samples=5).fit_predict(X)
    
    df = pd.DataFrame(tsne_em,columns=['tsne1','tsne2'])
    df['dbscan'] = get_clusters
    df['actual'] = y
    

    We plot the clustering results from dbscan:

    sns.scatterplot(x = "tsne1", y = "tsne2",hue = "dbscan",data=df)
    

    enter image description here

    The cluster size can be obtained:

    df['dbscan'].value_counts()
    
     1    63
     2    63
     0    59
    -1    15
    

    Percentages:

    df['dbscan'].value_counts(normalize=True)
     1    0.315
     2    0.315
     0    0.295
    -1    0.075
    

    Check with other labels, in this case I used the actual label, you can use your other annotations:

    actual  0   1   2
    dbscan          
       -1   4   8   3
        0   0   59  0
        1   0   0   63
        2   63  0   0