pythonpandasdataframerowwise

Compare rows pairwise and calculate difference


I have this dataframe. I want to compare the date then calculate time differences in that date. For example, date[0] == date[1] is True, then time[0] - time[1]. If there are 4 date entries like 2020-05-19 then do (time[2] - time[3]) + (time[4] - time[5]). But if there is an odd number of date entries, for example, there are only 3 entries for 2022-05-12, then do nothing.

        Date    Time
0   2022-05-20  17:07:00
1   2022-05-20  09:14:00
2   2022-05-19  18:56:00
3   2022-05-19  13:53:00
4   2022-05-19  13:52:00
5   2022-05-19  09:34:00
6   2022-05-18  18:25:00
7   2022-05-18  12:53:00
8   2022-05-18  12:02:00
9   2022-05-18  10:01:00
10  2022-05-17  18:06:00
11  2022-05-17  12:23:00
12  2022-05-17  12:11:00
13  2022-05-17  09:57:00
14  2022-05-16  18:44:00
15  2022-05-16  09:57:00
16  2022-05-13  18:21:00
17  2022-05-13  12:42:00
18  2022-05-13  12:05:00
19  2022-05-13  10:02:00
20  2022-05-12  18:13:00
21  2022-05-12  13:06:00
22  2022-05-12  09:45:00
23  2022-05-11  18:04:00
24  2022-05-11  12:23:00
25  2022-05-11  11:59:00
26  2022-05-11  10:01:00
27  2022-05-10  17:33:00
28  2022-05-10  12:29:00

I'm trying on a nested for loop but could not figure out the indexing changes when odd entries happen.

for i in range(len(df.Date)-1):
    for j in range(1,len(df.Date),2):
        if df.Date[i] ==  df.Date[j]:
            print(df.Date[i], df.Date[j],df.Time[i],df.Time[j])  
            i += 2  
        else:
            print(i,j)
            print(df.Date[i], df.Date[j],df.Time[i],df.Time[j])
            i = j
            j = j+1
            print(i,j)
    break

The result which is incorrect and I do not know how to fix.

Full expected output:

Output

Followed the comment and did df.to_dict() and got:

{'Date': {0: Timestamp('2022-05-20 00:00:00'), 1: Timestamp('2022-05-20 00:00:00'), 2: Timestamp('2022-05-19 00:00:00'), 3: Timestamp('2022-05-19 00:00:00'), 4: Timestamp('2022-05-19 00:00:00'), 5: Timestamp('2022-05-19 00:00:00'), 6: Timestamp('2022-05-18 00:00:00'), 7: Timestamp('2022-05-18 00:00:00'), 8: Timestamp('2022-05-18 00:00:00'), 9: Timestamp('2022-05-18 00:00:00'), 10: Timestamp('2022-05-17 00:00:00'), 11: Timestamp('2022-05-17 00:00:00'), 12: Timestamp('2022-05-17 00:00:00'), 13: Timestamp('2022-05-17 00:00:00'), 14: Timestamp('2022-05-16 00:00:00'), 15: Timestamp('2022-05-16 00:00:00'), 16: Timestamp('2022-05-13 00:00:00'), 17: Timestamp('2022-05-13 00:00:00'), 18: Timestamp('2022-05-13 00:00:00'), 19: Timestamp('2022-05-13 00:00:00'), 20: Timestamp('2022-05-12 00:00:00'), 21: Timestamp('2022-05-12 00:00:00'), 22: Timestamp('2022-05-12 00:00:00'), 23: Timestamp('2022-05-11 00:00:00'), 24: Timestamp('2022-05-11 00:00:00'), 25: Timestamp('2022-05-11 00:00:00'), 26: Timestamp('2022-05-11 00:00:00'), 27: Timestamp('2022-05-10 00:00:00'), 28: Timestamp('2022-05-10 00:00:00')}, 'Time': {0: datetime.time(17, 7), 1: datetime.time(9, 14), 2: datetime.time(18, 56), 3: datetime.time(13, 53), 4: datetime.time(13, 52), 5: datetime.time(9, 34), 6: datetime.time(18, 25), 7: datetime.time(12, 53), 8: datetime.time(12, 2), 9: datetime.time(10, 1), 10: datetime.time(18, 6), 11: datetime.time(12, 23), 12: datetime.time(12, 11), 13: datetime.time(9, 57), 14: datetime.time(18, 44), 15: datetime.time(9, 57), 16: datetime.time(18, 21), 17: datetime.time(12, 42), 18: datetime.time(12, 5), 19: datetime.time(10, 2), 20: datetime.time(18, 13), 21: datetime.time(13, 6), 22: datetime.time(9, 45), 23: datetime.time(18, 4), 24: datetime.time(12, 23), 25: datetime.time(11, 59), 26: datetime.time(10, 1), 27: datetime.time(17, 33), 28: datetime.time(12, 29)}}

Solution

  • Try this.

    # data
    df = pd.DataFrame({'Date': [Timestamp('2022-05-20 00:00:00'), Timestamp('2022-05-20 00:00:00'), Timestamp('2022-05-19 00:00:00'), Timestamp('2022-05-19 00:00:00'), Timestamp('2022-05-19 00:00:00'), Timestamp('2022-05-19 00:00:00'), Timestamp('2022-05-18 00:00:00'), Timestamp('2022-05-18 00:00:00'), Timestamp('2022-05-18 00:00:00'), Timestamp('2022-05-18 00:00:00'), Timestamp('2022-05-17 00:00:00'), Timestamp('2022-05-17 00:00:00'), Timestamp('2022-05-17 00:00:00'), Timestamp('2022-05-17 00:00:00'), Timestamp('2022-05-16 00:00:00'), Timestamp('2022-05-16 00:00:00'), Timestamp('2022-05-13 00:00:00'), Timestamp('2022-05-13 00:00:00'), Timestamp('2022-05-13 00:00:00'), Timestamp('2022-05-13 00:00:00'), Timestamp('2022-05-12 00:00:00'), Timestamp('2022-05-12 00:00:00'), Timestamp('2022-05-12 00:00:00'), Timestamp('2022-05-11 00:00:00'), Timestamp('2022-05-11 00:00:00'), Timestamp('2022-05-11 00:00:00'), Timestamp('2022-05-11 00:00:00'), Timestamp('2022-05-10 00:00:00'), Timestamp('2022-05-10 00:00:00'), Timestamp('2022-05-10 00:00:00'), Timestamp('2022-05-10 00:00:00'), Timestamp('2022-05-09 00:00:00'), Timestamp('2022-05-09 00:00:00'), Timestamp('2022-05-09 00:00:00'), Timestamp('2022-05-09 00:00:00'), Timestamp('2022-05-09 00:00:00'), Timestamp('2022-05-09 00:00:00'), Timestamp('2022-05-06 00:00:00'), Timestamp('2022-05-06 00:00:00'), Timestamp('2022-05-06 00:00:00'), Timestamp('2022-05-06 00:00:00'), Timestamp('2022-05-05 00:00:00'), Timestamp('2022-05-05 00:00:00'), Timestamp('2022-05-05 00:00:00'), Timestamp('2022-05-05 00:00:00'), Timestamp('2022-05-04 00:00:00'), Timestamp('2022-05-04 00:00:00'), Timestamp('2022-05-04 00:00:00'), Timestamp('2022-05-04 00:00:00'), Timestamp('2022-05-03 00:00:00'), Timestamp('2022-05-03 00:00:00'), Timestamp('2022-05-03 00:00:00'), Timestamp('2022-05-03 00:00:00'), Timestamp('2022-05-02 00:00:00'), Timestamp('2022-05-02 00:00:00'), Timestamp('2022-05-02 00:00:00'), Timestamp('2022-05-02 00:00:00'), Timestamp('2022-04-29 00:00:00'), Timestamp('2022-04-29 00:00:00'), Timestamp('2022-04-29 00:00:00'), Timestamp('2022-04-29 00:00:00'), Timestamp('2022-04-28 00:00:00'), Timestamp('2022-04-28 00:00:00'), Timestamp('2022-04-28 00:00:00'), Timestamp('2022-04-28 00:00:00'), Timestamp('2022-04-27 00:00:00'), Timestamp('2022-04-27 00:00:00'), Timestamp('2022-04-27 00:00:00'), Timestamp('2022-04-27 00:00:00'), Timestamp('2022-04-26 00:00:00'), Timestamp('2022-04-26 00:00:00'), Timestamp('2022-04-26 00:00:00'), Timestamp('2022-04-26 00:00:00'), Timestamp('2022-04-26 00:00:00'), Timestamp('2022-04-26 00:00:00'), Timestamp('2022-04-25 00:00:00'), Timestamp('2022-04-25 00:00:00'), Timestamp('2022-04-25 00:00:00'), Timestamp('2022-04-22 00:00:00'), Timestamp('2022-04-22 00:00:00'), Timestamp('2022-04-22 00:00:00'), Timestamp('2022-04-22 00:00:00'), Timestamp('2022-04-20 00:00:00'), Timestamp('2022-04-20 00:00:00'), Timestamp('2022-04-20 00:00:00'), Timestamp('2022-04-20 00:00:00')], 'Time': [datetime.time(17, 7), datetime.time(9, 14), datetime.time(18, 56), datetime.time(13, 53), datetime.time(13, 52), datetime.time(9, 34), datetime.time(18, 25), datetime.time(12, 53), datetime.time(12, 2), datetime.time(10, 1), datetime.time(18, 6), datetime.time(12, 23), datetime.time(12, 11), datetime.time(9, 57), datetime.time(18, 44), datetime.time(9, 57), datetime.time(18, 21), datetime.time(12, 42), datetime.time(12, 5), datetime.time(10, 2), datetime.time(18, 13), datetime.time(13, 6), datetime.time(9, 45), datetime.time(18, 4), datetime.time(12, 23), datetime.time(11, 59), datetime.time(10, 1), datetime.time(17, 33), datetime.time(12, 29), datetime.time(12, 2), datetime.time(9, 59), datetime.time(18, 14), datetime.time(15, 3), datetime.time(14, 33), datetime.time(12, 12), datetime.time(11, 58), datetime.time(9, 46), datetime.time(17, 51), datetime.time(11, 37), datetime.time(11, 14), datetime.time(9, 34), datetime.time(18, 4), datetime.time(12, 33), datetime.time(12, 13), datetime.time(9, 42), datetime.time(18, 27), datetime.time(12, 27), datetime.time(12, 14), datetime.time(9, 58), datetime.time(18, 2), datetime.time(13, 0), datetime.time(11, 40), datetime.time(9, 5), datetime.time(18, 39), datetime.time(12, 52), datetime.time(12, 27), datetime.time(9, 59), datetime.time(17, 26), datetime.time(12, 9), datetime.time(12, 8), datetime.time(9, 49), datetime.time(18, 24), datetime.time(13, 18), datetime.time(12, 4), datetime.time(9, 57), datetime.time(18, 16), datetime.time(12, 37), datetime.time(12, 8), datetime.time(9, 42), datetime.time(18, 25), datetime.time(14, 13), datetime.time(13, 54), datetime.time(12, 55), datetime.time(12, 54), datetime.time(9, 52), datetime.time(17, 54), datetime.time(9, 53), datetime.time(9, 53), datetime.time(18, 12), datetime.time(14, 7), datetime.time(12, 15), datetime.time(9, 53), datetime.time(18, 20), datetime.time(12, 53), datetime.time(12, 12), datetime.time(9, 50)]})
    
    def f(g):
        # remove the last value if there are odd number of Times
        x = g if len(g)%2==0 else g[1:]
        # find difference between every 2 time values and sum the result
        return np.diff(x.values.reshape(-1,2), 1).reshape(-1).sum()
    
    # groupby Date and apply f to Time
    # then strip days from result
    df['new_col'] = pd.to_timedelta(df.Time[::-1].astype(str)).groupby(df.Date).transform(f).astype(str).str.split().str[-1]
    # remove duplicated values
    df['new_col'] = df.new_col.mask(df.new_col.duplicated(), '')
    df
    

    enter image description here

    The key idea is to reshape Time for each Date into a 2 column array and use np.diff horizontally to find the difference (this is inside function f). To reshape an array into a 2 column array, the number of rows must be divisible by 2, so if it's not, I cut away the last Time (even though it's cutting away the first element in the code, it's actually the last because the array is reversed.) I used Time[::-1] to reverse Time. That's because np.diff works like time[1]-time[0] etc. Since the aim is the reverse, I reversed Time.