I need to create a violin plot based on two categories. But, some of the combination of categories are not available in the data. So it creates a white space, when i try to make the plot. I remember long ago i was able to adjust the size of the violins when the categories were not available in r using geom_violin(position= position_dodge(0.9))
refer to the attached image.
Now i need to create a similar figure with python but when i try to make violin plot using seaborn i get whitespace when certain combinations of variables arent available (see image).
Following is the code I am using in python. I would appreciate any help with this.
Reproducible example
import numpy as np
# Define categories for Depth and Hydraulic Conductivity
depth_categories = ["<0.64", "0.64-0.82", "0.82-0.90", ">0.9"]
hydraulic_conductivity_categories = ["<0.2", "0.2-2.2", "2.2-15.5", ">15.5"]
# Generate random HSI values
np.random.seed(42) # For reproducibility
hsi_values = np.random.uniform(low=0, high=35, size=30)
# Generate random categories for Depth and Hydraulic Conductivity
depth_values = np.random.choice(depth_categories, size=30)
hydraulic_conductivity_values = np.random.choice(hydraulic_conductivity_categories, size=30)
# Ensure not all combinations are available by removing some combinations
for i in range(5):
depth_values[i] = depth_categories[i % len(depth_categories)]
hydraulic_conductivity_values[i] = hydraulic_conductivity_categories[(i + 1) % len(hydraulic_conductivity_categories)]
# Create the DataFrame
dummy_data = pd.DataFrame({
'HSI': hsi_values,
'Depth': depth_values,
'Hydraulic_Conductivity': hydraulic_conductivity_values
})
# Violin plot for Soil Depth and Hydraulic Conductivity
plt.figure(figsize=(12, 6))
sns.violinplot(x='Depth', y='HSI', hue='Hydraulic_Conductivity', data=dummy_data, palette=color_palette,
density_norm="count",
cut = 0,
gap = 0.1,
linewidth=0.5,
common_norm=False,
dodge=True)
plt.xlabel("DDDD")
plt.ylabel("XXX")
plt.title("Violin plot of XXX by YYYY and DDDD")
plt.ylim(-5, 35)
plt.legend(title='DDDD', loc='upper right')
# sns.despine()# Remove the horizontal lines
plt.show()
I'm not aware of a way to do this automatically, but you can easily overlay several violinplots, manually synchronizing the hue colors.
An efficient way would be to use groupby
to split the groups per number of "hues" per X-axis category, and loop over the categories. Then manually create a legend:
# for reproducibility
color_palette = sns.color_palette('Set1')
# define the columns to use
hue_col = 'Hydraulic_Conductivity'
X_col = 'Depth'
Y_col = 'HSI'
# custom order for the hues
hue_order = sorted(dummy_data[hue_col].unique(),
key=lambda x: (not x.startswith('<'), float(x.strip('<>').partition('-')[0]))
)
# ['<0.2', '0.2-2.2', '2.2-15.5', '>15.5']
colors = dict(zip(hue_order, color_palette))
# custom X-order
# could use the same logic as above
X_order = ['<0.64', '0.64-0.82', '0.82-0.90', '>0.9']
# create groups with number of hues per X-axis group
group = dummy_data.groupby(X_col)[hue_col].transform('nunique')
f, ax = plt.subplots(figsize=(12, 6))
for _, g in dummy_data.groupby(group):
# get unique hues for this group to ensure consistent order
hues = set(g[hue_col])
hues = [h for h in hue_order if h in hues]
sns.violinplot(
x=X_col, y=Y_col, hue=hue_col, data=g,
order=X_order,
hue_order=hues, # ensure consistent order across groups
palette=colors,
density_norm='count',
cut = 0,
gap = 0.1,
linewidth=0.5,
common_norm=False,
dodge=True,
ax=ax, # reuse the same axes
legend=False, # do not plot the legend
)
# create a custom legend manually from the colors dictionary
import matplotlib.patches as mpatches
plt.legend(handles=[mpatches.Patch(color=c, label=l) for l, c in colors.items()],
title='DDDD', loc='upper right')
plt.xlabel('DDDD')
plt.ylabel('XXX')
plt.title('Violin plot of XXX by YYYY and DDDD')
plt.ylim(-5, 35)
Output:
NB. your example have a few categories with a single datapoint, therefore the single lines in the output below. This makes the categories ambiguous since the color is not visible, but this shouldn't be an issue if you have enough data.