pythonpandasseaborn

Order points in seaborn


I'm trying to make a stripplot / point plot / scatter plot, where the points in each category are sorted, based on the y value (see example in this forum post). I would like the points in each gene category sorted (as in the linked example with the two categories: placebo and full).

How can this be done in seaborn / pandas?

A simple example input would be:

pd.DataFrame({
    "Gene": ["Gene1", "Gene1", "Gene1", "Gene1", "Gene1", "Gene1",
             "Gene2", "Gene2", "Gene2", "Gene2", "Gene2", "Gene2"],
    "Value": [80, 1205, 5, 150, 50, 80,
              12, 5235, 235, 1245, 126, 10]})

And more elaborate:

pd.DataFrame({
    "Gene": ["Gene1", "Gene1", "Gene1", "Gene1", "Gene1", "Gene1",
             "Gene2", "Gene2", "Gene2", "Gene2", "Gene2", "Gene2"],
    "Value": [80, 1205, 5, 150, 50, 80,
              12, 5235, 235, 1245, 126, 10]}),
"State": ["active", "inactive", "active", "inactive", "active", "active",
              "active", "active", "active", "inactive", "inactive", "inactive"]})

So the genes are the y-ticks, the values are the points and the activity is the hue.

Example of a stripplot, to illustrate the desired result: example of a stripplot


Solution

  • Building on @Fourier's answer, I propose the following solution.

    I don't think you can use stripplot to achieve the desired result, but that's ok, that's not what stripplot is made for anyway. The situation is fairly straightforward if you don't have several hues. Then the boxplots are simply located at x-values 0,1,2... and have a width that can be defined in the call to boxplot (0.8 by default). Knowing these pieces of information, it is fairly simple to calculate what the x-values of our points should be so they are centered over the boxplot:

    df = pd.DataFrame({
        "Gene": ["Gene1", "Gene1", "Gene1", "Gene1", "Gene1", "Gene1",
                 "Gene2", "Gene2", "Gene2", "Gene2", "Gene2", "Gene2"],
        "Value": [80, 1205, 5, 150, 50, 80,
                  12, 5235, 235, 1245, 126, 10]})
    
    order = ['Gene1','Gene2']
    width = 0.8
    fig, ax = plt.subplots()
    sns.boxplot(x='Gene',y='Value',data=df, orient='v', color='w', fliersize=0, order=order, width=width, ax=ax)
    for x0,o in enumerate(order):
        temp_df = df[df['Gene']==o]
        x_vals = temp_df['Value'].rank(method='first')
        x_vals = np.interp(x_vals, [x_vals.min(), x_vals.max()],[x0-width/2, x0+width/2])
        ax.plot(x_vals, temp_df['Value'], 'o')
    

    enter image description here

    Solution if using hue-nesting

    In fact, if you are using hue-nesting, the situation is not really more complicated. It's just a matter of knowing the x-coordinates of the different box plots and their width. As it happens, I've recently answered another question that had pretty much the same requirements, so both solutions are pretty close.

    df = pd.DataFrame({
        "Gene": ["Gene1", "Gene1", "Gene1", "Gene1", "Gene1", "Gene1",
                 "Gene2", "Gene2", "Gene2", "Gene2", "Gene2", "Gene2"],
        "Value": [80, 1205, 5, 150, 50, 80,
                  12, 5235, 235, 1245, 126, 10],
        "State": ["active", "inactive", "active", "inactive", "active", "active",
                  "active", "active", "active", "inactive", "inactive", "inactive"]
    })
    
    
    order = ['Gene1','Gene2']
    hue_order = ['active','inactive']
    width = 0.8
    # get the offsets used by boxplot when hue-nesting is used
    # https://github.com/mwaskom/seaborn/blob/c73055b2a9d9830c6fbbace07127c370389d04dd/seaborn/categorical.py#L367
    n_levels = len(hue_order)
    each_width = width / n_levels
    offsets = np.linspace(0, width - each_width, n_levels)
    offsets -= offsets.mean()
    
    fig, ax = plt.subplots()
    sns.boxplot(x='Gene',y='Value',hue='State', data=df, orient='v', color='w', fliersize=0, order=order, hue_order=hue_order, width=width, ax=ax)
    
    for x0,o in enumerate(order):
        for h,off in zip(hue_order, offsets):
            temp_df = df[(df['Gene']==o)&(df['State']==h)]
            x_vals = temp_df['Value'].rank(method='first')
            x_vals = np.interp(x_vals, [x_vals.min(), x_vals.max()],[(x0+off)-each_width/2, (x0+off)+each_width/2])
            ax.plot(x_vals, temp_df['Value'], 'o')
    

    enter image description here