pythonpandasmatplotlibpcabiplot

Biplots in matrix format using pca


This is a snippet of my dataframe:

    species bill_length_mm  bill_depth_mm   flipper_length_mm     body_mass_g   predicted_species
0   Adelie       18                   18         181             3750                Chinstrap
1   Adelie       17                   17         186             3800                Adelie
2   Adelie       18                   18         195             3250                Gentoo
3   Adelie       0                    0           0               0                  Adelie
4   Chinstrap    19                   19         193             3450                Chinstrap
5   Chinstrap    20                   20         190             3650                Gentoo
6   Chinstrap    17                   17         181             3625                Adelie
7   Gentoo       19                   19         195             4675                Chinstrap
8   Gentoo       18                   18         193             3475                Gentoo
9   Gentoo       20                   20         190             4250                Gentoo

I want to make a biplot for my data, which would be something like this: enter image description here

But I want to make a biplot for every species vs predicted_species matrix, so 9 subplots,same as above, I am not sure how that can be achieved. One way could be to split into dataframes, and make a biplot for each, but that isn't very efficient and difficult for comparison.

Can anyone provide some suggestions on how this could be done?


Solution

  • Combining the answer by Qiyun Zhu on how to plot a biplot with my answer on how to split the plot into the true vs. predicted subsets, you could do it like this:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.preprocessing import StandardScaler
    from sklearn.decomposition import PCA
    
    # Load iris data.
    iris = sns.load_dataset('iris')
    X = iris.iloc[:, :4].values
    y = iris.iloc[:, 4].values
    features = iris.columns[:4]
    targets = ['setosa', 'versicolor', 'virginica']
    
    # Mock up some predictions.
    iris['species_pred'] = (40 * ['setosa'] + 5 * ['versicolor'] + 5 * ['virginica']
                            + 40 * ['versicolor'] + 5 * ['setosa'] + 5 * ['virginica']
                            + 40 * ['virginica'] + 5 * ['versicolor'] + 5 * ['setosa'])
    
    # Reduce features to two dimensions.
    X_scaled = StandardScaler().fit_transform(X)
    pca = PCA(n_components=2).fit(X_scaled)
    X_reduced = pca.transform(X_scaled)
    iris[['pc1', 'pc2']] = X_reduced
    
    
    def biplot(x, y, data=None, **kwargs):
        # Plot data points.
        sns.scatterplot(data=data, x=x, y=y, **kwargs)
        
        # Calculate arrow parameters.
        loadings = pca.components_[:2].T
        pvars = pca.explained_variance_ratio_[:2] * 100
        arrows = loadings * np.ptp(X_reduced, axis=0)
        width = -0.0075 * np.min([np.subtract(*plt.xlim()), np.subtract(*plt.ylim())])
    
        # Plot arrows.
        horizontal_alignment = ['right', 'left', 'right', 'right']
        vertical_alignment = ['bottom', 'top', 'top', 'bottom']
        for (i, arrow), ha, va in zip(enumerate(arrows), 
                                      horizontal_alignment, vertical_alignment):
            plt.arrow(0, 0, *arrow, color='k', alpha=0.5, width=width, ec='none',
                      length_includes_head=True)
            plt.text(*(arrow * 1.05), features[i], ha=ha, va=va, 
                     fontsize='small', color='gray')
    
        
    # Plot small multiples, corresponding to confusion matrix.
    sns.set()
    g = sns.FacetGrid(iris, row='species', col='species_pred', 
                      hue='species', margin_titles=True)
    g.map(biplot, 'pc1', 'pc2')
    plt.show()
    

    biplot split into nine parts