pythonggplot2plotnine

Custom colors for median segmests in plotnine geom_boxplot


I want to draw the median segments in geom_boxplot() with custom colors. I found a solution for R-ggplot2 with gg_build() which provides x, xend, y, yend inputs for geom_segment() to overlay the median segments on the boxplot.

I couldn't find gg_build() equivalent functionality in plotnine, so my approach is to construct a new dataframe which calculates these 4 values for each group, needed by geom_segment.

To that end, I know how to get y and yend for each group - these values are the group medians. However, not clear on how to calculate x and xend? Since I need to find a x-value for each group (for my use-case, groups names are str type). Additionally, I also require the box-width used in geom_boxplot().

Any suggestions on how to extract/calculate those?

Thanks!


Solution

  • Instead of extracting the median values from the dataset created by geom_boxplot under the hood you can create an aggregated dataframe with the medians which could then be used to draw the median lines using geom_segment. (And as an R user I would guess that this is the way most R users would approach this problem.) The tricky part is to get the x and xend coordinates for the groups. To this end I use pd.factorize to convert the group column to a sequence of numbers to which I add +/- half of the default box plot width of .75.

    Using a minimal reproducible example based on the mtcars dataset:

    from plotnine import ggplot, geom_boxplot, aes, geom_segment
    from plotnine.data import mtcars
    import pandas as pd
    
    
    df_median = mtcars.groupby("cyl")["mpg"].median().reset_index()
    
    df_median['x'] = pd.factorize(df_median['cyl'])[0] + 1
    df_median['xend'] = df_median['x'] + .75 / 2
    df_median['x'] = df_median['x'] - .75 / 2
    
    (ggplot(mtcars, aes("factor(cyl)", "mpg"))
     + geom_boxplot() 
     + geom_segment(
         mapping = aes(x = "x", xend = "xend", yend = "mpg", y = "mpg", color = "factor(cyl)"),
         data = df_median, size = 1))
    

    enter image description here