I am trying to create overlapping and transparent violin plots split by one variable using seaborn in python. My dataset looks like this:
The variable "names" are "one" to "nine", "distance" is from 0 to 1, condition is either "healthy" or "disease", and "sample_id" is 1 to 16. Each "condition" has 8 sample_ids.
Please see my current result below:
As you can see, the problem is that the two halves of the violin plot are wrong orientation for each of the "name" variables, and the legend contains disease/healthy "condition" variable for each of the 16 sample_ids.
The code that I am using for this is:
my_ids=my_dataset.sample_id.unique()
my_condition_palette={"disease": "darkorange","healthy":"steelblue"}
fig, ax = plt.pyplot.subplots()
for sample_id in my_ids:
sns.violinplot(data=my_dataset[my_dataset.sample_id==sample_id], x="name", y="distance", hue="condition", hue_order=["disease", "healthy"], palette=my_condition_palette, cut=0, linewidth=0, inner=None, split=True,density_norm="count",common_norm=False, gap=0.1)
for violin in ax.collections:
violin.set_alpha(1/8)
Does anyone know what I am doing wrong here? Or perhaps there is a better way of plotting this? Thank you!
With density_norm="count"
, the width of the violin for the x-value with the highest count (for the given sample_id
) is maximized. The width of the other violins is shrunk relative to their count.
In the given dataset, it seems that each sample_id
is either fully 'healthy' or fully 'disease'. When drawing one sample_id
, seaborn thinks there is only one hue value active, which will occupy the full width for each of the x-values. You can use dodge=True
to force the violin to be reduced and put on the correct side.
For the legend, you can set legend=False
for all except one of the sample_id
s.
The following code creates reproducible test data and shows how everything could work. order=
sets the order of the x values.
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# first, create some dummy test data
np.random.seed(20250120)
df = pd.DataFrame({'sample_id': np.repeat(np.arange(1, 17), 100)})
names = ['one', 'two', 'three', 'four', 'five', 'six']
prob = np.random.rand(len(names)) ** 2 + 0.1 # use different probabilities for each 'name'
prob /= prob.sum() # the probabilities need to sum to 1
df['name'] = np.random.choice(names, len(df), p=prob)
df['distance'] = np.random.rand(len(df))
df['condition'] = np.where(df['sample_id'] % 2 == 1, 'disease', 'healthy')
my_ids = df.sample_id.unique()
my_condition_palette = {"disease": "darkorange", "healthy": "steelblue"}
fig, ax = plt.subplots()
for sample_id in my_ids:
sns.violinplot(data=df[df['sample_id'] == sample_id], x="name", y="distance", order=names,
hue="condition", hue_order=["disease", "healthy"], palette=my_condition_palette,
cut=0, linewidth=0, inner=None, split=True, density_norm="count", common_norm=False, gap=0.1,
dodge=True,
legend=sample_id == my_ids[0])
for violin in ax.collections:
violin.set_alpha(1 / 8)
sns.despine()
sns.move_legend(ax, loc="upper left", bbox_to_anchor=(1, 1))
ax.set_xlabel('') # remove superfluous x label
plt.tight_layout()
plt.show()
PS:
This is how the plot looks without dodge=True
, and plotting only the first sample. The "half" violins are rescaled to occupy the full width (default 0.8
wide) for each x value.