pythonpandasdataframerolling-computation

Rolling count/tally between a list of values in a creative way


I have 2 DataFrames. The first is has a List of Dicts of values. The second has no data, but has a list of columns that are integers.

data1 = [{'Start': 51, 'End': 55},{'Start':24, 'End':37},{'Start':89,'End':122},{'Start':44, 'End':31}, {'Start':77, 'End':50}, {'Start':10, 'End':9}]
dfm1 = pd.DataFrame.from_dict(data1)

data2 = [-40, -30, -20, -10, 0, 10, 20, 30, 40]
dfm2 = pd.DataFrame([], columns=data2)

Let's assume that data1 has 500 data points. My goal is that I want a tally of ranges of the differences in dfm1 based upon a variable sliding window size and I want that tally to exist in dfm2.

I want to create a sliding window of calculation of the difference between data1[index + window] - data1[index]. Then, based upon that difference between the values at the 2 indexes, I want to add a tally to dfm2 if it is less than or equal to the dfm1 column value, but not less than the column-1 value. We would assume that, in my example, column -40 would never ever have a tally greater than 0.

My desired date output, for the dfm1 values I provided, and we are tallying Start values, and a window of size 2 would be (for dfm2):

[0, 1, 1, 0, 0, 0, 1, 0, 1]

This would be performing 51-89 = -38, 24-44 = -20, 89-77=12, 44-10=34 for a window size of 2. A window size of 3 would be 51-44, 24-77, and 89-10...

The cheap and easy way is obviously for me to iterate and create tallies. But I know that DataFrame has some methods like rolling which may work really well for this.

What if I wanted to do this same rolling tally, but rather than subtracting Start from Start, what if I wanted to subtract an index's Start from its same End, and then perform that rolling tally based upon the difference from window_size away?

What if I didn't preset the column names in dfm2, and I let them be auto added as new tallies are discovered, say in ranges of 10?


Solution

  • With rolling and the first part of the question:

    def tally_differences(dfm1, dfm2, window_size):
        diff_df = dfm1['Start'].diff(window_size)
    
        bins = dfm2.columns.tolist() + [float('inf')]
        labels = dfm2.columns.tolist()
        tallies = pd.cut(diff_df, bins=bins, labels=labels, right=False).value_counts()
    
        dfm2_updated = dfm2.copy()
        dfm2_updated.loc[0] = tallies
    
        return dfm2_updated
    
    
    window_size = 2
    dfm2_updated = tally_differences(dfm1, dfm2, window_size)
    print(dfm2_updated)
    

    Output:

       -40  -30  -20  -10   0    10   20   30   40
    0    1    0    1    0    0    0    1    1    0
    

    For your super boss mode, I don't know if I understood it well or not ( I hope someone edits your question text in a proper way ), but:

    def tally_differences_easter_egg(dfm1, window_size):
        diff_df = dfm1['End'] - dfm1['Start']
    
        bins = pd.interval_range(start=diff_df.min()//10*10, end=diff_df.max()//10*10+10, freq=10, closed='left')
        tallies = pd.cut(diff_df.rolling(window_size).sum(), bins=bins).dropna().astype(str).value_counts()
    
        dfm2 = pd.DataFrame(columns=tallies.index)
        dfm2.loc[0] = tallies
    
        return dfm2
    
    window_size = 2
    dfm2_easter_egg = tally_differences_easter_egg(dfm1, window_size)
    print(dfm2_easter_egg)
    

    This will gave us:

       [10, 20)  [20, 30)  [-30, -20)
    0         1         1           1