pythonrseabornheatmaprpy2

Plotting a fancy diagonal correlation matrix with coefficients in upper triangle


I have the following synthetic dataframe, including numerical and categorical columns as well as the label column. I want to plot a diagonal correlation matrix and display correlation coefficients in the upper part as the following:

expected output:

img

Despite the point that categorical columns within synthetic dataset/dataframedf needs to be converted into numerical, So far I have used this seaborn example using 'titanic' dataset which is synthetic and fits my task, but I added label column as follows:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="white")

# Generate a large random dataset with synthetic nature (categorical + numerical)
data = sns.load_dataset("titanic")
df = pd.DataFrame(data=data)

# Generate label column randomly '0' or '1'
df['label'] = np.random.randint(0,2, size=len(df))

# Compute the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmin=-1.0, vmax=1.0, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

img

I checked a related post but couldn't figure it out to do this task. The best I could find so far is this workaround which can be installed using this package that gives me the following output:

#!pip install heatmapz
# Import the two methods from heatmap library
from heatmap import heatmap, corrplot
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="white")

# Generate a large random dataset
data = sns.load_dataset("titanic")
df = pd.DataFrame(data=data)

# Generate label column randomly '0' or '1'
df['label'] = np.random.randint(0,2, size=len(df))

# Compute the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool)) 
mask[np.diag_indices_from(mask)] = False
np.fill_diagonal(mask, True)

# Set up the matplotlib figure
plt.figure(figsize=(8, 8))

# Draw the heatmap using "Heatmapz" package
corrplot(corr[mask], size_scale=300)

img

Sadly, corr[mask] doesn't mask the upper triangle in this package.

I also noticed that in R, reaching this fancy plot is much easier, so I'm open if there is a more straightforward way to convert Python Pandas dataFrame to R dataframe since it seems there is a package, so-called rpy2 that we could use Python & R together even in Google Colab notebook: Ref.1

from rpy2.robjects import pandas2ri
pandas2ri.activate() 

So if it is the case, I find this post1 & post2 using R for regarding Visualization of a correlation matrix. So, in short, my 1st priority is using Python and its packages Matplotlib, seaborn, Plotly Express, and then R and its packages to reach the expected output.

Note

I provided you with executable code in google Colab notebook with R using dataset so that you can form/test your final answer if your solution is by rpy2 otherwise I'd be interested in a Pythonic solution.


Solution

  • I'd be interested in a Pythonic solution.

    Use a seaborn scatter plot with matplotlib text/line annotations:

    1. Plot the lower triangle via sns.scatterplot with square markers
    2. Annotate the upper triangle via plt.text
    3. Draw the heatmap grid via plt.vlines and plt.hlines

    Full code using the titanic sample:

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    sns.set_theme(style="white")
    
    # generate sample correlation matrix
    df = sns.load_dataset("titanic")
    df["label"] = np.random.randint(0, 2, size=len(df))
    corr = df.corr()
    
    # mask and melt correlation matrix
    mask = np.tril(np.ones_like(corr, dtype=bool)) | corr.abs().le(0.1)
    melt = corr.mask(mask).melt(ignore_index=False).reset_index()
    melt["size"] = melt["value"].abs()
    
    fig, ax = plt.subplots(figsize=(8, 6))
    
    # normalize colorbar
    cmap = plt.cm.RdBu
    norm = plt.Normalize(-1, 1)
    sm = plt.cm.ScalarMappable(norm=norm, cmap=cmap)
    cbar = plt.colorbar(sm, ax=ax)
    cbar.ax.tick_params(labelsize="x-small")
    
    # plot lower triangle (scatter plot with normalized hue and square markers)
    sns.scatterplot(ax=ax, data=melt, x="index", y="variable", size="size",
                    hue="value", hue_norm=norm, palette=cmap,
                    style=0, markers=["s"], legend=False)
    
    # format grid
    xmin, xmax = (-0.5, corr.shape[0] - 0.5)
    ymin, ymax = (-0.5, corr.shape[1] - 0.5)
    ax.vlines(np.arange(xmin, xmax + 1), ymin, ymax, lw=1, color="silver")
    ax.hlines(np.arange(ymin, ymax + 1), xmin, xmax, lw=1, color="silver")
    ax.set(aspect=1, xlim=(xmin, xmax), ylim=(ymax, ymin), xlabel="", ylabel="")
    ax.tick_params(labelbottom=False, labeltop=True)
    plt.xticks(rotation=90)
    
    # annotate upper triangle
    for y in range(corr.shape[0]):
        for x in range(corr.shape[1]):
            value = corr.mask(mask).to_numpy()[y, x]
            if pd.notna(value):
                plt.text(x, y, f"{value:.2f}", size="x-small",
                         # color=sm.to_rgba(value), weight="bold",
                         ha="center", va="center")
    

    Note that since most of these titanic correlations are low, I disabled the text coloring for readability.

    If you want color-coded text, uncomment the color=sm.to_rgba(value) line at the end: