I'm trying to make a stripplot / point plot / scatter plot, where the points in each category are sorted, based on the y value (see example in this forum post). I would like the points in each gene category sorted (as in the linked example with the two categories: placebo and full).
How can this be done in seaborn / pandas?
A simple example input would be:
pd.DataFrame({
"Gene": ["Gene1", "Gene1", "Gene1", "Gene1", "Gene1", "Gene1",
"Gene2", "Gene2", "Gene2", "Gene2", "Gene2", "Gene2"],
"Value": [80, 1205, 5, 150, 50, 80,
12, 5235, 235, 1245, 126, 10]})
And more elaborate:
pd.DataFrame({
"Gene": ["Gene1", "Gene1", "Gene1", "Gene1", "Gene1", "Gene1",
"Gene2", "Gene2", "Gene2", "Gene2", "Gene2", "Gene2"],
"Value": [80, 1205, 5, 150, 50, 80,
12, 5235, 235, 1245, 126, 10]}),
"State": ["active", "inactive", "active", "inactive", "active", "active",
"active", "active", "active", "inactive", "inactive", "inactive"]})
So the genes are the y-ticks, the values are the points and the activity is the hue.
Building on @Fourier's answer, I propose the following solution.
I don't think you can use stripplot to achieve the desired result, but that's ok, that's not what stripplot is made for anyway.
The situation is fairly straightforward if you don't have several hues
. Then the boxplots are simply located at x-values 0,1,2... and have a width
that can be defined in the call to boxplot (0.8 by default). Knowing these pieces of information, it is fairly simple to calculate what the x-values of our points should be so they are centered over the boxplot:
df = pd.DataFrame({
"Gene": ["Gene1", "Gene1", "Gene1", "Gene1", "Gene1", "Gene1",
"Gene2", "Gene2", "Gene2", "Gene2", "Gene2", "Gene2"],
"Value": [80, 1205, 5, 150, 50, 80,
12, 5235, 235, 1245, 126, 10]})
order = ['Gene1','Gene2']
width = 0.8
fig, ax = plt.subplots()
sns.boxplot(x='Gene',y='Value',data=df, orient='v', color='w', fliersize=0, order=order, width=width, ax=ax)
for x0,o in enumerate(order):
temp_df = df[df['Gene']==o]
x_vals = temp_df['Value'].rank(method='first')
x_vals = np.interp(x_vals, [x_vals.min(), x_vals.max()],[x0-width/2, x0+width/2])
ax.plot(x_vals, temp_df['Value'], 'o')
Solution if using hue-nesting
In fact, if you are using hue-nesting, the situation is not really more complicated. It's just a matter of knowing the x-coordinates of the different box plots and their width. As it happens, I've recently answered another question that had pretty much the same requirements, so both solutions are pretty close.
df = pd.DataFrame({
"Gene": ["Gene1", "Gene1", "Gene1", "Gene1", "Gene1", "Gene1",
"Gene2", "Gene2", "Gene2", "Gene2", "Gene2", "Gene2"],
"Value": [80, 1205, 5, 150, 50, 80,
12, 5235, 235, 1245, 126, 10],
"State": ["active", "inactive", "active", "inactive", "active", "active",
"active", "active", "active", "inactive", "inactive", "inactive"]
})
order = ['Gene1','Gene2']
hue_order = ['active','inactive']
width = 0.8
# get the offsets used by boxplot when hue-nesting is used
# https://github.com/mwaskom/seaborn/blob/c73055b2a9d9830c6fbbace07127c370389d04dd/seaborn/categorical.py#L367
n_levels = len(hue_order)
each_width = width / n_levels
offsets = np.linspace(0, width - each_width, n_levels)
offsets -= offsets.mean()
fig, ax = plt.subplots()
sns.boxplot(x='Gene',y='Value',hue='State', data=df, orient='v', color='w', fliersize=0, order=order, hue_order=hue_order, width=width, ax=ax)
for x0,o in enumerate(order):
for h,off in zip(hue_order, offsets):
temp_df = df[(df['Gene']==o)&(df['State']==h)]
x_vals = temp_df['Value'].rank(method='first')
x_vals = np.interp(x_vals, [x_vals.min(), x_vals.max()],[(x0+off)-each_width/2, (x0+off)+each_width/2])
ax.plot(x_vals, temp_df['Value'], 'o')