pythonseabornsframe

Draw a boxplot using an sframe as data source


I'm practicing my ML Classification skills on The Billionaire Characteristics Database dataset.

I'm using sframe for loading and manipulating the data and seaborn for visualization.

In the process of data analysis, I wanted to draw a box plot grouped by a categorical variable, like this one from seaborn tutorial: box plot grouped by categorical value

In the dataset, there's a networthusbillion numerical variable and selfmade categorical variable that states whether a billionaire is self-made or (s)he has inherited the bucks.

When I try to draw a similar box plot using sns.boxplot(x='selfmade', y='networthusbillion', data=data), it throws the following error:

---------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-17-f4bd651c2ae7> in <module>()
----> 1 sns.boxplot(x='selfmade', y='networthusbillion', data=billionaires)

/home/iulian/.virtualenvs/data-science-python2/lib/python2.7/site-packages/seaborn/categorical.pyc in boxplot(x, y, hue, data, order, hue_order, orient, color, palette, saturation, width, fliersize, linewidth, whis, notch, ax, **kwargs)
   2127     plotter = _BoxPlotter(x, y, hue, data, order, hue_order,
   2128                           orient, color, palette, saturation,
-> 2129                           width, fliersize, linewidth)
   2130 
   2131     if ax is None:

/home/iulian/.virtualenvs/data-science-python2/lib/python2.7/site-packages/seaborn/categorical.pyc in __init__(self, x, y, hue, data, order, hue_order, orient, color, palette, saturation, width, fliersize, linewidth)
    420                  width, fliersize, linewidth):
    421 
--> 422         self.establish_variables(x, y, hue, data, orient, order, hue_order)
    423         self.establish_colors(color, palette, saturation)
    424 

/home/iulian/.virtualenvs/data-science-python2/lib/python2.7/site-packages/seaborn/categorical.pyc in establish_variables(self, x, y, hue, data, orient, order, hue_order, units)
    136             # See if we need to get variables from `data`
    137             if data is not None:
--> 138                 x = data.get(x, x)
    139                 y = data.get(y, y)
    140                 hue = data.get(hue, hue)

AttributeError: 'SFrame' object has no attribute 'get'

I tried the following forms to draw the box plot - none of them achieved the result:

sns.boxplot(x=billionaires['selfmade'], y=billionaires['networthusbillion'])
sns.boxplot(x='selfmade', y='networthusbillion', data=billionaires['selfmade', 'networthusbillion'])

However, I could draw a box plot using sframe, but without grouping by selfmade:

sns.boxplot(x=billionaires['networthusbillion'])

So, my question is: Is there a way to draw a box plot grouped by a categorical variable using an sframe? Maybe I'm doing something wrong?

By the way, I managed to draw it using a pandas.DataFrame using the same syntax (sns.boxplot(x='selfmade', y='networthusbillion', data=data)), so maybe grouping using an sframe with seaborn is just not implemented yet.


Solution

  • TL;DR

    Grouping using an sframe with seaborn is just not implemented yet.


    After digging into the seaborn's source code, I found out that it is designed specifically to work with pandas.DataFrame. Taking the absolutelyNoWarranty's suggestion in their answer, I got the following error:

    TypeError: __getitem__() takes exactly 2 arguments (3 given)
    

    Taking a look at the args in the get function on call, there's this data:

    ('gender', 'gender')
    

    This happens because of this code in the source code for BoxPlot:

    # See if we need to get variables from `data`
    if data is not None:
        x = data.get(x, x)
        y = data.get(y, y)
        hue = data.get(hue, hue)
        units = data.get(units, units)
    

    It tries to get the value and uses the same value as a fallback in case it doesn't exist. This causes an error in the __getitem__(), because it gets called with (self, 'gender', 'gender') arguments.

    I tried to rewrite the get() function as follows:

    def get(self, *args):
        return self.__getitem__(args[0]) if args[0] else None  # The `None` is here because the `units` in the source code is `None` for boxplots.
    

    And here I got the error that ended my tries:

    TypeError: 'SArray' object is not callable
    

    Taking a look at the source code, it checks whether the y data is a pd.Series and if not, it converts y value into one:

    if not isinstance(vals, pd.Series):
        vals = pd.Series(vals)
    
    # Group the val data
    grouped_vals = vals.groupby(grouper)
    

    When executing the vals.groupby(grouper) (grouper still being an SArray instance), it goes into pandas core workings where the SArray is called and the error is thrown. End of story.