pandasoverlappingbinning

In Pandas, how do I bin floats and count string types from other cols where the bin is of a rolling, overlapping numeric range of constant size?


I have a table of float values and strings such as the following:

FltVal Category
0.01 A
0.02 A
0.05 B
0.31 B
0.36 A
0.37 C
0.41 B

I would like to produce a new dataframe that bins the float values into bins of size 0.3 (so each bin is (x, x + 0.3]) and that overlaps in 0.01 increments. Additionally I'd like a column tallying the number of records following into the bin, as well as a column tallying each 'Category' total for the bin.

I first tried to get a basic count going with groupby and pd.cut (before trying to find counts for each Category):

import pandas as pd

floats = pd.Series([0.01,0.02,0.05,0.31,0.36,0.37,0.41])
categories = pd.Series(['A','A','B','B','A','C','B'])
data = {"FltVal": floats, "Category": categories}
df = pd.concat(data, axis=1)

grouped_vals = df.groupby(pd.cut(df['FltVal'],np.arange(df['FltVal'].min(),df['FltVal'].max(),0.3))).count()

Output:

FltVal  Category
FltVal  
(0.01, 0.31]       3         3

I think the problem there was that it couldn't consider another bin, let alone overlapping?

Then, with the idea I could just throw out non-sensible (like negatives) ranges afterward, I tried the following:

FltVal_range = np.arange(df['FltVal'].min(),df['FltVal'].max(),0.01)
FltVal_range_from = FltVal_range - 0.3
FltVal_range_to = FltVal_range
FltVal_intervals = pd.IntervalIndex.from_arrays(FltVal_range_from,FltVal_range_to)
binned_df = df.groupby(pd.cut(df['FltVal'], FltVal_intervals))

But got the ValueError message:

Overlapping IntervalIndex is not accepted.

If pd.cut doesn't accept overlapping bins, I guess it's a non-starter.

Regardless, desired output is the following:

FloatBin Count Cat_A Cat_B Cat_C
0.00, 0.30 3 2 1 0
0.01, 0.31 3 1 2 0
0.02, 0.32 2 0 2 0
0.03, 0.33 2 0 2 0
0.04, 0.34 2 0 2 0
0.05, 0.35 1 0 1 0
0.06, 0.36 2 1 1 0
0.07, 0.37 3 1 1 1
0.08, 0.38 3 1 1 1
0.09, 0.39 3 1 1 1
0.1, 0.4 3 1 1 1
0.11, 0.41 4 1 2 1

An efficient approach is necessary as the actual dataframe will have rows in the hundreds-of-thousands to millions.

I'm thinking of something similar to this answer but suitable for finding counts for all my categories.

All help is much appreciated!


Solution

  • One option using janitor's conditional_join:

    import janitor
    
    FltVal_range = np.arange(df['FltVal'].min(), df['FltVal'].max(), 0.01)
    
    tmp = (pd.crosstab(df['FltVal'], df['Category'],
                       margins=True, margins_name='Count')
             .drop('Count')
           )
    
    out = (pd.DataFrame({'low': FltVal_range, 'high': FltVal_range+0.3})
             .conditional_join(tmp.reset_index(),
                               ('low', 'FltVal', '<'),
                               ('high', 'FltVal', '>='))
             .groupby(['low', 'high'])[list(tmp)].sum()
          )
    

    Output:

               A  B  C  Count
    low  high                
    0.01 0.31  1  2  0      3
    0.02 0.32  0  2  0      2
    0.03 0.33  0  2  0      2
    0.04 0.34  0  2  0      2
    0.05 0.35  0  1  0      1
    0.06 0.36  1  1  0      2
    0.07 0.37  1  1  1      3
    0.08 0.38  1  1  1      3
    0.09 0.39  1  1  1      3
    0.10 0.40  1  1  1      3
    0.11 0.41  1  2  1      4
    0.12 0.42  1  2  1      4
    0.13 0.43  1  2  1      4
    0.14 0.44  1  2  1      4
    0.15 0.45  1  2  1      4
    ...