pythonpandaspandas-groupbygaps-in-data

How to find out the gaps in python pandas dataframe column (date format)?


I have one pandas Dataframe like below:

name,year
AAA,2015-11-02 22:00:00
AAA,2015-11-02 23:00:00
AAA,2015-11-03 00:00:00
AAA,2015-11-03 01:00:00
AAA,2015-11-03 02:00:00
AAA,2015-11-03 05:00:00
ZZZ,2015-09-01 00:00:00
ZZZ,2015-11-01 01:00:00
ZZZ,2015-11-01 07:00:00
ZZZ,2015-11-01 08:00:00
ZZZ,2015-11-01 09:00:00
ZZZ,2015-11-01 12:00:00

I want to find out the gaps available in the year column of the dataframe with respect to particular name. For example,

  1. AAA name has gap ahead "2015-11-03 02:00:00" date with 2 hours.
  2. ZZZ name has gap ahead "2015-11-01 01:00:00" date with 5 hours.
  3. ZZZ name has gap ahead "2015-11-01 09:00:00" date with 2 hours.

I want to generate two csv files with contents:

CSV-1:

name,year
AAA,2015-11-02 22:00:00,0
AAA,2015-11-02 23:00:00,0
AAA,2015-11-03 00:00:00,0
AAA,2015-11-03 01:00:00,0
AAA,2015-11-03 02:00:00,2
AAA,2015-11-03 05:00:00,0
ZZZ,2015-09-01 00:00:00,0
ZZZ,2015-11-01 01:00:00,5
ZZZ,2015-11-01 07:00:00,0
ZZZ,2015-11-01 08:00:00,0
ZZZ,2015-11-01 09:00:00,2
ZZZ,2015-11-01 12:00:00,0

CSV-2:

name,prev_year,next_year,gaps
AAA,2015-11-03 02:00:00,2015-11-03 05:00:00,2015-11-03 03:00:00
AAA,2015-11-03 02:00:00,2015-11-03 05:00:00,2015-11-03 04:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 02:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 03:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 04:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 05:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 06:00:00
ZZZ,2015-11-01 09:00:00,2015-11-01 12:00:00,2015-11-01 10:00:00
ZZZ,2015-11-01 09:00:00,2015-11-01 12:00:00,2015-11-01 11:00:00

I tried like below:

df['year'] = pd.to_datetime(df['year'], format='%Y-%m-%d %H:%M:%S')
mask = df.groupby("name").year.diff() > pd.Timedelta('0 days 01:00:00')

Solution

  • To get your gap into your dataframe you need to reassign the mask that you generate. To get this in terms of total hours, you can simply divide by 1 hour:

    df['year'] = pd.to_datetime(df['year'], format='%Y-%m-%d %H:%M:%S')
    df['Gap'] = (df.groupby("name").year.diff() / pd.to_timedelta('1 hour')).fillna(0)
    

    This gives us the following dataframe:

       name                year     Gap
    0   AAA 2015-11-02 22:00:00     0.0
    1   AAA 2015-11-02 23:00:00     1.0
    2   AAA 2015-11-03 00:00:00     1.0
    3   AAA 2015-11-03 01:00:00     1.0
    4   AAA 2015-11-03 02:00:00     1.0
    5   AAA 2015-11-03 05:00:00     3.0
    6   ZZZ 2015-09-01 00:00:00     0.0
    7   ZZZ 2015-11-01 07:00:00     6.0
    8   ZZZ 2015-11-01 08:00:00     1.0
    9   ZZZ 2015-11-01 09:00:00     1.0
    10  ZZZ 2015-11-01 12:00:00     3.0
    

    To get the gap next to its start time and consistent with the way you want it for "csv-1", we simply shift this up one row and subtract one before we fill the na values:

    df['Gap'] = ((df.groupby("name").year.diff() / pd.to_timedelta('1 hour')).shift(-1) - 1).fillna(0)
    

    This gets:

       name                year  Gap
    0   AAA 2015-11-02 22:00:00  0.0
    1   AAA 2015-11-02 23:00:00  0.0
    2   AAA 2015-11-03 00:00:00  0.0
    3   AAA 2015-11-03 01:00:00  0.0
    4   AAA 2015-11-03 02:00:00  2.0
    5   AAA 2015-11-03 05:00:00  0.0
    6   ZZZ 2015-11-01 01:00:00  5.0
    7   ZZZ 2015-11-01 07:00:00  0.0
    8   ZZZ 2015-11-01 08:00:00  0.0
    9   ZZZ 2015-11-01 09:00:00  2.0
    10  ZZZ 2015-11-01 12:00:00  0.0
    

    In order to get your second csv, we can do the following:

    df['prev_year'] = df['year']
    df['next_year'] = df.groupby('name')['year'].shift(-1)
    
    df.set_index('year', inplace=True)
    df = df.groupby('name', as_index=False)\
           .resample(rule='1H')\
           .ffill()\
           .reset_index()
    
    gaps = df[df['year'] != df['prev_year']][['name', 'prev_year', 'next_year', 'year']]
    
    gaps.rename({'year': 'gaps'}, index='columns', inplace=True)
    

    First we set up the "before" and "after" columns. Then by changing the index to 'year', we can use the .resample() method to fill in all of our missing hours. by using ffill() when we resample, we copy down the last available record into all the new rows we add. We know that when 'prev_year' != 'year', we're on a row that didn't previously exist in the frame, and thus is one of the gaps, so we filter to just those rows, select the columns we need and rename them. This gives:

       name           prev_year           next_year                year
    5   AAA 2015-11-03 02:00:00 2015-11-03 05:00:00 2015-11-03 03:00:00
    6   AAA 2015-11-03 02:00:00 2015-11-03 05:00:00 2015-11-03 04:00:00
    9   ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 02:00:00
    10  ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 03:00:00
    11  ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 04:00:00
    12  ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 05:00:00
    13  ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 06:00:00
    17  ZZZ 2015-11-01 09:00:00 2015-11-01 12:00:00 2015-11-01 10:00:00
    18  ZZZ 2015-11-01 09:00:00 2015-11-01 12:00:00 2015-11-01 11:00:00
    

    To summarise, your script could look as follows:

    df['year'] = pd.to_datetime(df['year'], format='%Y-%m-%d %H:%M:%S')
    df['Gap'] = ((df.groupby("name").year.diff() / pd.to_timedelta('1 hour')).shift(-1) - 1).fillna(0)
    
    df.to_csv('csv-1.csv', index=False)
    
    df['prev_year'] = df['year']
    df['next_year'] = df.groupby('name')['year'].shift(-1)
    
    df.set_index('year', inplace=True)
    df = df.groupby('name', as_index=False)\
           .resample(rule='1H')\
           .ffill()\
           .reset_index()
    
    gaps = df[df['year'] != df['prev_year']][['name', 'prev_year', 'next_year', 'year']]
    
    gaps.rename({'year': 'gaps'}, index='columns', inplace=True)
    
    gaps.to_csv('csv-2.csv', index=False)