I have 2 DataFrames. The first is has a List of Dicts of values. The second has no data, but has a list of columns that are integers.
data1 = [{'Start': 51, 'End': 55},{'Start':24, 'End':37},{'Start':89,'End':122},{'Start':44, 'End':31}, {'Start':77, 'End':50}, {'Start':10, 'End':9}]
dfm1 = pd.DataFrame.from_dict(data1)
data2 = [-40, -30, -20, -10, 0, 10, 20, 30, 40]
dfm2 = pd.DataFrame([], columns=data2)
Let's assume that data1 has 500 data points. My goal is that I want a tally of ranges of the differences in dfm1
based upon a variable sliding window size and I want that tally to exist in dfm2
.
I want to create a sliding window of calculation of the difference between data1[index + window] - data1[index]
. Then, based upon that difference between the values at the 2 indexes, I want to add a tally to dfm2 if it is less than or equal to the dfm1
column value, but not less than the column-1 value. We would assume that, in my example, column -40
would never ever have a tally greater than 0.
My desired date output, for the dfm1
values I provided, and we are tallying Start values, and a window of size 2 would be (for dfm2
):
[0, 1, 1, 0, 0, 0, 1, 0, 1]
This would be performing 51-89 = -38, 24-44 = -20, 89-77=12, 44-10=34 for a window size of 2. A window size of 3 would be 51-44, 24-77, and 89-10...
The cheap and easy way is obviously for me to iterate and create tallies. But I know that DataFrame
has some methods like rolling
which may work really well for this.
What if I wanted to do this same rolling tally, but rather than subtracting Start from Start, what if I wanted to subtract an index's Start from its same End, and then perform that rolling tally based upon the difference from window_size
away?
What if I didn't preset the column names in dfm2
, and I let them be auto added as new tallies are discovered, say in ranges of 10?
With rolling
and the first part of the question:
def tally_differences(dfm1, dfm2, window_size):
diff_df = dfm1['Start'].diff(window_size)
bins = dfm2.columns.tolist() + [float('inf')]
labels = dfm2.columns.tolist()
tallies = pd.cut(diff_df, bins=bins, labels=labels, right=False).value_counts()
dfm2_updated = dfm2.copy()
dfm2_updated.loc[0] = tallies
return dfm2_updated
window_size = 2
dfm2_updated = tally_differences(dfm1, dfm2, window_size)
print(dfm2_updated)
Output:
-40 -30 -20 -10 0 10 20 30 40
0 1 0 1 0 0 0 1 1 0
For your super boss mode, I don't know if I understood it well or not ( I hope someone edits your question text in a proper way ), but:
def tally_differences_easter_egg(dfm1, window_size):
diff_df = dfm1['End'] - dfm1['Start']
bins = pd.interval_range(start=diff_df.min()//10*10, end=diff_df.max()//10*10+10, freq=10, closed='left')
tallies = pd.cut(diff_df.rolling(window_size).sum(), bins=bins).dropna().astype(str).value_counts()
dfm2 = pd.DataFrame(columns=tallies.index)
dfm2.loc[0] = tallies
return dfm2
window_size = 2
dfm2_easter_egg = tally_differences_easter_egg(dfm1, window_size)
print(dfm2_easter_egg)
This will gave us:
[10, 20) [20, 30) [-30, -20)
0 1 1 1