pythonpandasseabornboxplotcatplot

How to create boxplots from a pandas column of strings


I'm trying to plot arrays as boxplot from a dataframe as the second picture here.

An extract of my data (I have data over 6 years, 150 per year) :

columns : idx | id | mods | Mean(Moyennes) | Median | Values_array | date2021

idx1 | 2021012 | Day | 273.7765808105 | 273.5100097656 |
272.3800048828,272.3800048828,272.3999938965,272.3999938965,276.5199890137,274.3800048828,274.3800048828 |2021-12-01T00:00:00.000Z

idx2 | 2021055 | Night| 287.5215759277 | 287.6099853516 | 286.0400085449,286.0400085449,286.0400085449,286.0400085449,284.8599853516,285.0400085449,285.0400085449,286.7200012207,286.799987793,286.799987793,287,288.2399902344,288.2399902344 |2021-02-24T00:00:00.000Z

Here is my data plotted with sns.relplot
Here is my data plotted with sns.relplot

To plot it, I tried :

sns.boxplot(data=df2018, x="Moyennes", y="date2018", hue = "mods")

It turns out, it looks like this
It turns out, it looks like this

I don't understand why the date turns out like this and not like with sns.relplot. Also, I want to boxplot my array as a all because in my understanding you have to put an array for it to compute mean, median etc ..

I also tried :

for i, j in sorted(df2017.iterrows()):
    values = j[4]
    date = j[6]
    id=j[0]
    fig, ax1 = plt.subplots(figsize=(10, 6))
    fig.canvas.manager.set_window_title('Température 2020')
    fig.subplots_adjust(left=0.075, right=0.95, top=0.9, bottom=0.25)
    bp = ax1.boxplot(values, notch=False, sym='+', vert=True, whis=1.5)
    plt.setp(bp['boxes'], color='black')
    plt.setp(bp['whiskers'], color='black')
    plt.setp(bp['fliers'], color='red', marker='+')

the output is like this, which is nice but I want every boxplot of on year to be in the same plot.

like this

I'm working on vscode, vm linux.

My question is, how can I boxplot several arrays with seaborn?


Solution

  • import pandas as pd
    import seaborn as sns
    
    # sample data
    data = {'idx': ['idx1 ', 'idx2 '],
            'id': [2021012, 2021055],
            'mods': ['Day', 'Night'],
            'Mean(Moyennes)': [273.7765808105, 287.5215759277],
            'Median': [273.5100097656, 287.6099853516],
            'Values_array': ['272.3800048828,272.3800048828,272.3999938965,272.3999938965,276.5199890137,274.3800048828,274.3800048828', '286.0400085449,286.0400085449,286.0400085449,286.0400085449,284.8599853516,285.0400085449,285.0400085449,286.7200012207,286.799987793,286.799987793,287,288.2399902344,288.2399902344'],
            'date2021': ['2021-12-01T00:00:00.000Z', '2021-02-24T00:00:00.000Z']}
    df = pd.DataFrame(data)
    
    # convert the column to a datetime.date type since there's no time component
    df.date2021 = pd.to_datetime(df.date2021).dt.date
    
    # split the strings in the Values_array column
    df.Values_array = df.Values_array.str.split(',')
    
    # explode the list of strings to individual rows
    df = df.explode(column='Values_array', ignore_index=True)
    
    # set the type of the Values_array column to float
    df.Values_array = df.Values_array.astype(float)
    
    # plot the data in a single facet
    g = sns.catplot(data=df, x='date2021', y='Values_array', kind='box')
    

    enter image description here

    # same plot with sns.boxplot instead of sns.catplot
    g = sns.boxplot(data=df, x='date2021', y='Values_array')
    

    enter image description here

    df before cleaning

         idx       id   mods  Mean(Moyennes)      Median                                                                                                                                                                           Values_array                  date2021
    0  idx1   2021012    Day      273.776581  273.510010                                                                               272.3800048828,272.3800048828,272.3999938965,272.3999938965,276.5199890137,274.3800048828,274.3800048828  2021-12-01T00:00:00.000Z
    1  idx2   2021055  Night      287.521576  287.609985  286.0400085449,286.0400085449,286.0400085449,286.0400085449,284.8599853516,285.0400085449,285.0400085449,286.7200012207,286.799987793,286.799987793,287,288.2399902344,288.2399902344  2021-02-24T00:00:00.000Z
    

    df after cleaning

          idx       id   mods  Mean(Moyennes)      Median  Values_array    date2021
    0   idx1   2021012    Day      273.776581  273.510010    272.380005  2021-12-01
    1   idx1   2021012    Day      273.776581  273.510010    272.380005  2021-12-01
    2   idx1   2021012    Day      273.776581  273.510010    272.399994  2021-12-01
    3   idx1   2021012    Day      273.776581  273.510010    272.399994  2021-12-01
    4   idx1   2021012    Day      273.776581  273.510010    276.519989  2021-12-01
    5   idx1   2021012    Day      273.776581  273.510010    274.380005  2021-12-01
    6   idx1   2021012    Day      273.776581  273.510010    274.380005  2021-12-01
    7   idx2   2021055  Night      287.521576  287.609985    286.040009  2021-02-24
    8   idx2   2021055  Night      287.521576  287.609985    286.040009  2021-02-24
    9   idx2   2021055  Night      287.521576  287.609985    286.040009  2021-02-24
    10  idx2   2021055  Night      287.521576  287.609985    286.040009  2021-02-24
    11  idx2   2021055  Night      287.521576  287.609985    284.859985  2021-02-24
    12  idx2   2021055  Night      287.521576  287.609985    285.040009  2021-02-24
    13  idx2   2021055  Night      287.521576  287.609985    285.040009  2021-02-24
    14  idx2   2021055  Night      287.521576  287.609985    286.720001  2021-02-24
    15  idx2   2021055  Night      287.521576  287.609985    286.799988  2021-02-24
    16  idx2   2021055  Night      287.521576  287.609985    286.799988  2021-02-24
    17  idx2   2021055  Night      287.521576  287.609985    287.000000  2021-02-24
    18  idx2   2021055  Night      287.521576  287.609985    288.239990  2021-02-24
    19  idx2   2021055  Night      287.521576  287.609985    288.239990  2021-02-24