pythonpandasdataframedata-sciencedata-munging

Pandas dataframe how to add a columns based on rank in a dates vector


I have a vector of dates

L = [Timestamp('2018-07-15 00:00:00'),
  Timestamp('2019-07-15 00:00:00')]

and a dataframe with a date column:

df = c1 c2 Date
     1. 2. 2018-07-13 16:00:00
     1. 7. 2018-07-15 16:00:00
     3. 7. 2018-07-15 16:50:00
     4. 7. 2018-07-15 19:50:00
     2. 2. 2018-07-16 16:00:00
     5. 1. 2020-10-10 16:00:00
     8. 4. 2018-06-13 16:00:00   
     5. 4. 2021-12-13 16:00:00
     2. 9. 2019-01-01 16:00:00
     2. 6. 2019-01-01 17:00:00

I want to add a column that is the rank in the dates vector, and delete rows from the same date (regardless of hour/minutes). So I will have:

df = c1 c2 d.                  new
     1. 2. 2018-07-13 16:00:00  0
     1. 4. 2018-06-13 16:00:00  0
     2. 2. 2018-07-16 16:00:00  1
     5. 1. 2020-10-10 16:00:00  2
     8. 4. 2018-06-13 16:00:00  0
     5. 4. 2021-12-13 16:00:00  2
     2. 9. 2019-01-01 16:00:00  1
     2. 6. 2019-01-01 17:00:00. 1

What is the best way to do it?


Solution

  • You can compare 2 numpy arrays with broadcasting, for count Trues use sum, last filter out matched dates without times with Series.dt.normalize and Series.isin with invert mask by ~:

    L = [pd.Timestamp('2018-07-15 00:00:00'),pd.Timestamp('2019-07-15 00:00:00')]
    
    df['Date'] = pd.to_datetime(df['Date'])
    
    df['new'] = (df['Date'].to_numpy()[:, None] > pd.to_datetime(L).to_numpy()).sum(axis=1)
    
    df1 = df[~df['Date'].dt.normalize().isin(L)]
    print (df1)
        c1   c2                Date  new
    0  1.0  2.0 2018-07-13 16:00:00    0
    4  2.0  2.0 2018-07-16 16:00:00    1
    5  5.0  1.0 2020-10-10 16:00:00    2
    6  8.0  4.0 2018-06-13 16:00:00    0
    7  5.0  4.0 2021-12-13 16:00:00    2
    8  2.0  9.0 2019-01-01 16:00:00    1
    9  2.0  6.0 2019-01-01 17:00:00    1