I'm practicing my ML Classification skills on The Billionaire Characteristics Database dataset.
I'm using sframe
for loading and manipulating the data and seaborn
for visualization.
In the process of data analysis, I wanted to draw a box plot grouped by a categorical variable, like this one from seaborn
tutorial:
In the dataset, there's a networthusbillion
numerical variable and selfmade
categorical variable that states whether a billionaire is self-made
or (s)he has inherited
the bucks.
When I try to draw a similar box plot using sns.boxplot(x='selfmade', y='networthusbillion', data=data)
, it throws the following error:
---------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-17-f4bd651c2ae7> in <module>()
----> 1 sns.boxplot(x='selfmade', y='networthusbillion', data=billionaires)
/home/iulian/.virtualenvs/data-science-python2/lib/python2.7/site-packages/seaborn/categorical.pyc in boxplot(x, y, hue, data, order, hue_order, orient, color, palette, saturation, width, fliersize, linewidth, whis, notch, ax, **kwargs)
2127 plotter = _BoxPlotter(x, y, hue, data, order, hue_order,
2128 orient, color, palette, saturation,
-> 2129 width, fliersize, linewidth)
2130
2131 if ax is None:
/home/iulian/.virtualenvs/data-science-python2/lib/python2.7/site-packages/seaborn/categorical.pyc in __init__(self, x, y, hue, data, order, hue_order, orient, color, palette, saturation, width, fliersize, linewidth)
420 width, fliersize, linewidth):
421
--> 422 self.establish_variables(x, y, hue, data, orient, order, hue_order)
423 self.establish_colors(color, palette, saturation)
424
/home/iulian/.virtualenvs/data-science-python2/lib/python2.7/site-packages/seaborn/categorical.pyc in establish_variables(self, x, y, hue, data, orient, order, hue_order, units)
136 # See if we need to get variables from `data`
137 if data is not None:
--> 138 x = data.get(x, x)
139 y = data.get(y, y)
140 hue = data.get(hue, hue)
AttributeError: 'SFrame' object has no attribute 'get'
I tried the following forms to draw the box plot - none of them achieved the result:
sns.boxplot(x=billionaires['selfmade'], y=billionaires['networthusbillion'])
sns.boxplot(x='selfmade', y='networthusbillion', data=billionaires['selfmade', 'networthusbillion'])
However, I could draw a box plot using sframe
, but without grouping by selfmade
:
sns.boxplot(x=billionaires['networthusbillion'])
So, my question is: Is there a way to draw a box plot grouped by a categorical variable using an sframe
? Maybe I'm doing something wrong?
By the way, I managed to draw it using a pandas.DataFrame
using the same syntax (sns.boxplot(x='selfmade', y='networthusbillion', data=data)
), so maybe grouping using an sframe
with seaborn
is just not implemented yet.
Grouping using an sframe
with seaborn
is just not implemented yet.
After digging into the seaborn's source code, I found out that it is designed specifically to work with pandas.DataFrame
. Taking the absolutelyNoWarranty's suggestion in their answer, I got the following error:
TypeError: __getitem__() takes exactly 2 arguments (3 given)
Taking a look at the args
in the get
function on call, there's this data:
('gender', 'gender')
This happens because of this code in the source code for BoxPlot
:
# See if we need to get variables from `data`
if data is not None:
x = data.get(x, x)
y = data.get(y, y)
hue = data.get(hue, hue)
units = data.get(units, units)
It tries to get the value and uses the same value as a fallback in case it doesn't exist. This causes an error in the __getitem__()
, because it gets called with (self, 'gender', 'gender')
arguments.
I tried to rewrite the get()
function as follows:
def get(self, *args):
return self.__getitem__(args[0]) if args[0] else None # The `None` is here because the `units` in the source code is `None` for boxplots.
And here I got the error that ended my tries:
TypeError: 'SArray' object is not callable
Taking a look at the source code, it checks whether the y
data is a pd.Series
and if not, it converts y
value into one:
if not isinstance(vals, pd.Series):
vals = pd.Series(vals)
# Group the val data
grouped_vals = vals.groupby(grouper)
When executing the vals.groupby(grouper)
(grouper still being an SArray
instance), it goes into pandas core workings where the SArray
is called and the error is thrown. End of story.