python-3.xpandastime-series

Flag repeating entries in pandas time series


I have a data frame that takes this form (but is several millions of rows long):

import pandas as pd     
dict = {'id':["A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D"], 
    'year': ["2000", "2001", "2002", "2000", "2001", "2003", "1999", "2000", "2001", "2000", "2000", "2001"],
    'vacation':["France", "Morocco", "Morocco", "Germany", "Germany", "Germany", "Japan", "Australia", "Japan", "Canada", "Mexico", "China"],
    'new':[1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1]} 
df = pd.DataFrame(dict)

A   2000    France
A   2001    Morocco
A   2002    Morocco
B   2000    Germany
B   2001    Germany
B   2003    Germany
C   1999    Japan
C   2000    Australia
C   2001    Japan
D   2000    Canada       
D   2000    Mexico       
D   2001    China        

For each person in each year, the holiday destination(s) is/are given; there can be multiple holiday destinations in a given year. I would like to flag the rows when a participant goes to holiday to a destination to which they had not gone the year before (i.e., the destination is new). In the case above, the output would be:

id  year    vacation    new
A   2000    France       1
A   2001    Morocco      1
A   2002    Morocco      0
B   2001    Germany      1
B   2002    Germany      0
B   2003    Germany      0
C   1999    Japan        1
C   1999    Australia    1
C   2000    Japan        1
D   2000    Canada       1
D   2000    Mexico       1
D   2001    China        1

For A, B, C, and D, the first holiday destination in our data frame is flagged as new. When A goes to Morocco two years in a row, the 2nd occurrence is not flagged, because A went there the year before. When B goes to Germany 3 times in a row, the 2nd and 3rd occurrences are not flagged. When person C goes to Japan twice, all of the occurrences are flagged, because they did not go to Japan two years in a row. D goes to 3 different destinations (albeit to 2 destinations in 2000) and all of them are flagged.

I have been trying to solve it myself, but have not been able to break away from iterations, which are too computationally intensive for such a massive dataset.

I'd appreciate any input; thanks.


Solution

  • What we are doing is grouping by id & vacation and ensuring that year is not equal to the year above, or we can selecting the first instance of that combination.

    df["new_2"] = (
        df.groupby(["id", "vacation"])["id", "year"]
        .apply(lambda x: x.ne(x.shift()))
        .all(axis=1)
        .add(0)
    )
    

    print(df)
      id  year   vacation  new_2
    0  A  2000     France    1
    1  A  2001        USA    1
    2  A  2002     France    0
    3  B  2001    Germany    1
    4  B  2002    Germany    0
    5  B  2003    Germany    0
    6  C  1999      Japan    1
    7  C  2000  Australia    1
    8  C  2001     France    1