I have the following synthetic dataframe, including numerical and categorical columns as well as the label
column.
I want to plot a diagonal correlation matrix and display correlation coefficients in the upper part as the following:
expected output:
Despite the point that categorical columns within synthetic dataset/dataframedf
needs to be converted into numerical, So far I have used this seaborn example using 'titanic'
dataset which is synthetic and fits my task, but I added label
column as follows:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="white")
# Generate a large random dataset with synthetic nature (categorical + numerical)
data = sns.load_dataset("titanic")
df = pd.DataFrame(data=data)
# Generate label column randomly '0' or '1'
df['label'] = np.random.randint(0,2, size=len(df))
# Compute the correlation matrix
corr = df.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmin=-1.0, vmax=1.0, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
I checked a related post but couldn't figure it out to do this task. The best I could find so far is this workaround which can be installed using this package that gives me the following output:
#!pip install heatmapz
# Import the two methods from heatmap library
from heatmap import heatmap, corrplot
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="white")
# Generate a large random dataset
data = sns.load_dataset("titanic")
df = pd.DataFrame(data=data)
# Generate label column randomly '0' or '1'
df['label'] = np.random.randint(0,2, size=len(df))
# Compute the correlation matrix
corr = df.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
mask[np.diag_indices_from(mask)] = False
np.fill_diagonal(mask, True)
# Set up the matplotlib figure
plt.figure(figsize=(8, 8))
# Draw the heatmap using "Heatmapz" package
corrplot(corr[mask], size_scale=300)
Sadly, corr[mask]
doesn't mask the upper triangle in this package.
I also noticed that in R, reaching this fancy plot is much easier, so I'm open if there is a more straightforward way to convert Python Pandas dataFrame to R dataframe since it seems there is a package, so-called rpy2
that we could use Python & R together even in Google Colab notebook: Ref.1
from rpy2.robjects import pandas2ri
pandas2ri.activate()
So if it is the case, I find this post1 & post2 using R for regarding Visualization of a correlation matrix.
So, in short, my 1st priority is using Python and its packages Matplotlib
, seaborn
, Plotly Express
, and then R and its packages to reach the expected output.
I provided you with executable code in google Colab notebook with R using dataset so that you can form/test your final answer if your solution is by rpy2
otherwise I'd be interested in a Pythonic solution.
I'd be interested in a Pythonic solution.
Use a seaborn scatter plot with matplotlib text/line annotations:
sns.scatterplot
with square markersplt.text
plt.vlines
and plt.hlines
Full code using the titanic
sample:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="white")
# generate sample correlation matrix
df = sns.load_dataset("titanic")
df["label"] = np.random.randint(0, 2, size=len(df))
corr = df.corr()
# mask and melt correlation matrix
mask = np.tril(np.ones_like(corr, dtype=bool)) | corr.abs().le(0.1)
melt = corr.mask(mask).melt(ignore_index=False).reset_index()
melt["size"] = melt["value"].abs()
fig, ax = plt.subplots(figsize=(8, 6))
# normalize colorbar
cmap = plt.cm.RdBu
norm = plt.Normalize(-1, 1)
sm = plt.cm.ScalarMappable(norm=norm, cmap=cmap)
cbar = plt.colorbar(sm, ax=ax)
cbar.ax.tick_params(labelsize="x-small")
# plot lower triangle (scatter plot with normalized hue and square markers)
sns.scatterplot(ax=ax, data=melt, x="index", y="variable", size="size",
hue="value", hue_norm=norm, palette=cmap,
style=0, markers=["s"], legend=False)
# format grid
xmin, xmax = (-0.5, corr.shape[0] - 0.5)
ymin, ymax = (-0.5, corr.shape[1] - 0.5)
ax.vlines(np.arange(xmin, xmax + 1), ymin, ymax, lw=1, color="silver")
ax.hlines(np.arange(ymin, ymax + 1), xmin, xmax, lw=1, color="silver")
ax.set(aspect=1, xlim=(xmin, xmax), ylim=(ymax, ymin), xlabel="", ylabel="")
ax.tick_params(labelbottom=False, labeltop=True)
plt.xticks(rotation=90)
# annotate upper triangle
for y in range(corr.shape[0]):
for x in range(corr.shape[1]):
value = corr.mask(mask).to_numpy()[y, x]
if pd.notna(value):
plt.text(x, y, f"{value:.2f}", size="x-small",
# color=sm.to_rgba(value), weight="bold",
ha="center", va="center")
Note that since most of these titanic
correlations are low, I disabled the text coloring for readability.
If you want color-coded text, uncomment the color=sm.to_rgba(value)
line at the end: