pythonpandaspcadimensionality-reductiontsne

Reduced dimensions visualization for true vs predicted values


I have a dataframe which looks like this:

label    predicted     F1  F2   F3 .... F40
major     minor         2   1   4
major     major         1   0   10
minor     patch         4   3   23
major     patch         2   1   11
minor     minor         0   4   8
patch     major         7   3   30
patch     minor         8   0   1
patch     patch         1   7   11

I have label which is the true label for the id(not shown as it is not relevant), and predicted label, and then set of around 40 features in my df.

The idea is to transform these 40 features into 2 dimensions and visualize them true vs predicted. We have 9 cases for all the three labels major,minor and patch vs their predictions.

With PCA, it is not able to capture much variance with 2 components and I am not sure how to map the PCA values with the labels and predictions in the original df as a whole. A way to achieve this is to separate all cases into 9 dataframes and achieve the result, but this isn't what I am looking for.

Is there any other way I can reduce and visualize the given data? Any suggestions would be highly appreciated.


Solution

  • You may want to consider a small multiple plot with one scatterplot for each cell of the confusion matrix.

    If PCA does not work well, t-distributed stochastic neighbor embedding (TSNE) is often a good alternative in my experience.

    For example, with the iris dataset, which also has three prediction classes, it could look like this:

    import pandas as pd
    import seaborn as sns
    from sklearn.manifold import TSNE
    
    iris = sns.load_dataset('iris')
    
    # Mock up some predictions.
    iris['species_pred'] = (40 * ['setosa'] + 5 * ['versicolor'] + 5 * ['virginica']
                            + 40 * ['versicolor'] + 5 * ['setosa'] + 5 * ['virginica']
                            + 40 * ['virginica'] + 5 * ['versicolor'] + 5 * ['setosa'])
    
    # Show confusion matrix.
    pd.crosstab(iris.species, iris.species_pred)
    
      species_pred  setosa  versicolor  virginica
    species             
    setosa              40           5          5
    versicolor           5          40          5
    virginica            5           5         40
    
    # Reduce features to two dimensions.
    X = iris.iloc[:, :4].values
    X_embedded = TSNE(n_components=2, init='random', learning_rate='auto'
                     ).fit_transform(X)
    iris[['tsne_x', 'tsne_y']] = X_embedded
    
    # Plot small multiples, corresponding to confusion matrix.
    sns.set()
    g = sns.FacetGrid(iris, row='species', col='species_pred', margin_titles=True)
    g.map(sns.scatterplot, 'tsne_x', 'tsne_y');
    

    small multiples plot