I have a table of float values and strings such as the following:
| FltVal | Category |
|---|---|
| 0.01 | A |
| 0.02 | A |
| 0.05 | B |
| 0.31 | B |
| 0.36 | A |
| 0.37 | C |
| 0.41 | B |
I would like to produce a new dataframe that bins the float values into bins of size 0.3 (so each bin is (x, x + 0.3]) and that overlaps in 0.01 increments. Additionally I'd like a column tallying the number of records following into the bin, as well as a column tallying each 'Category' total for the bin.
I first tried to get a basic count going with groupby and pd.cut (before trying to find counts for each Category):
import pandas as pd
floats = pd.Series([0.01,0.02,0.05,0.31,0.36,0.37,0.41])
categories = pd.Series(['A','A','B','B','A','C','B'])
data = {"FltVal": floats, "Category": categories}
df = pd.concat(data, axis=1)
grouped_vals = df.groupby(pd.cut(df['FltVal'],np.arange(df['FltVal'].min(),df['FltVal'].max(),0.3))).count()
Output:
FltVal Category
FltVal
(0.01, 0.31] 3 3
I think the problem there was that it couldn't consider another bin, let alone overlapping?
Then, with the idea I could just throw out non-sensible (like negatives) ranges afterward, I tried the following:
FltVal_range = np.arange(df['FltVal'].min(),df['FltVal'].max(),0.01)
FltVal_range_from = FltVal_range - 0.3
FltVal_range_to = FltVal_range
FltVal_intervals = pd.IntervalIndex.from_arrays(FltVal_range_from,FltVal_range_to)
binned_df = df.groupby(pd.cut(df['FltVal'], FltVal_intervals))
But got the ValueError message:
Overlapping IntervalIndex is not accepted.
If pd.cut doesn't accept overlapping bins, I guess it's a non-starter.
Regardless, desired output is the following:
| FloatBin | Count | Cat_A | Cat_B | Cat_C |
|---|---|---|---|---|
| 0.00, 0.30 | 3 | 2 | 1 | 0 |
| 0.01, 0.31 | 3 | 1 | 2 | 0 |
| 0.02, 0.32 | 2 | 0 | 2 | 0 |
| 0.03, 0.33 | 2 | 0 | 2 | 0 |
| 0.04, 0.34 | 2 | 0 | 2 | 0 |
| 0.05, 0.35 | 1 | 0 | 1 | 0 |
| 0.06, 0.36 | 2 | 1 | 1 | 0 |
| 0.07, 0.37 | 3 | 1 | 1 | 1 |
| 0.08, 0.38 | 3 | 1 | 1 | 1 |
| 0.09, 0.39 | 3 | 1 | 1 | 1 |
| 0.1, 0.4 | 3 | 1 | 1 | 1 |
| 0.11, 0.41 | 4 | 1 | 2 | 1 |
An efficient approach is necessary as the actual dataframe will have rows in the hundreds-of-thousands to millions.
I'm thinking of something similar to this answer but suitable for finding counts for all my categories.
All help is much appreciated!
One option using janitor's conditional_join:
import janitor
FltVal_range = np.arange(df['FltVal'].min(), df['FltVal'].max(), 0.01)
tmp = (pd.crosstab(df['FltVal'], df['Category'],
margins=True, margins_name='Count')
.drop('Count')
)
out = (pd.DataFrame({'low': FltVal_range, 'high': FltVal_range+0.3})
.conditional_join(tmp.reset_index(),
('low', 'FltVal', '<'),
('high', 'FltVal', '>='))
.groupby(['low', 'high'])[list(tmp)].sum()
)
Output:
A B C Count
low high
0.01 0.31 1 2 0 3
0.02 0.32 0 2 0 2
0.03 0.33 0 2 0 2
0.04 0.34 0 2 0 2
0.05 0.35 0 1 0 1
0.06 0.36 1 1 0 2
0.07 0.37 1 1 1 3
0.08 0.38 1 1 1 3
0.09 0.39 1 1 1 3
0.10 0.40 1 1 1 3
0.11 0.41 1 2 1 4
0.12 0.42 1 2 1 4
0.13 0.43 1 2 1 4
0.14 0.44 1 2 1 4
0.15 0.45 1 2 1 4
...