pythonpandasupsetplot

set analysis: create pandas series with intersections as index and values as counts


I've tried and tried, all day to try and make this work and it's starting to make me angry! All I want to do is create a necessary pandas series for input into upsetplot as detailed here:

https://pypi.org/project/upsetplot/

I don't understand how the generate_data function is manipulating its sets to make a series. I would have assumed that there was a simple way to do this by calling set(), but I can't seem to find it.

So I instead began manipulating my dataframes directly but suspected the attempts were misguided.

Thus I resort to providing a simple dataframe below and pray that some kind soul can enlighten me.

import pandas as pd
from matplotlib import pyplot as plt
from upsetplot import generate_data, plot

df = pd.DataFrame({'john':[1,2,3,5,7,8],
              'jerry':[1,2,5,7,9,2],
              'josie':[2,2,3,2,5,6],
              'jean':[6,5,7,6,2,4]})

df = pd.DataFrame({'john':[True,False,True,False,True,False],
              'jerry':[True,True,False,True,False,True],
              'josie':[True,False,False,True,False,False],
              'jean':[True,False,False,True,False,False],
              'food':['apple','carrot','choc','bread','ham','nut']})

the example from the package home

from upsetplot import generate_data
example = generate_data(aggregated=True)
example  # doctest: +NORMALIZE_WHITESPACE
set0   set1   set2
False  False  False      56
              True      283
       True   False    1279
              True     5882
True   False  False      24
              True       90
       True   False     429
              True     1957
Name: value, dtype: int64

Solution

  • Aggregate count by GroupBy.size with all columns without food:

    df = pd.DataFrame({'john':[True,False,True,False,True,False],
                  'jerry':[True,True,False,True,False,True],
                  'josie':[True,False,False,True,False,False],
                  'jean':[True,False,False,True,False,False],
                  'food':['apple','carrot','choc','bread','ham','nut']})
    
    cols = df.columns.difference(['food']).tolist()
    s = df.groupby(cols).size()
    print (s)
    jean   jerry  john   josie
    False  False  True   False    2
           True   False  False    2
    True   True   False  True     1
                  True   True     1
    dtype: int64