pythonpandasdataframedatetime64

Extracting the first day of month of a datetime type column in pandas


I have the following dataframe:

user_id    purchase_date 
  1        2015-01-23 14:05:21
  2        2015-02-05 05:07:30
  3        2015-02-18 17:08:51
  4        2015-03-21 17:07:30
  5        2015-03-11 18:32:56
  6        2015-03-03 11:02:30

and purchase_date is a datetime64[ns] column. I need to add a new column df[month] that contains first day of the month of the purchase date:

df['month']
2015-01-01
2015-02-01
2015-02-01
2015-03-01
2015-03-01
2015-03-01

I'm looking for something like DATE_FORMAT(purchase_date, "%Y-%m-01") m in SQL. I have tried the following code:

     df['month']=df['purchase_date'].apply(lambda x : x.replace(day=1))

It works somehow but returns: 2015-01-01 14:05:21.


Solution

  • Simpliest and fastest is convert to numpy array by to_numpy and then cast:

    df['month'] = df['purchase_date'].to_numpy().astype('datetime64[M]')
    print (df)
       user_id       purchase_date      month
    0        1 2015-01-23 14:05:21 2015-01-01
    1        2 2015-02-05 05:07:30 2015-02-01
    2        3 2015-02-18 17:08:51 2015-02-01
    3        4 2015-03-21 17:07:30 2015-03-01
    4        5 2015-03-11 18:32:56 2015-03-01
    5        6 2015-03-03 11:02:30 2015-03-01
    

    Another solution with floor and pd.offsets.MonthBegin(1) and add pd.offsets.MonthEnd(0) for correct ouput if first day of month:

    df['month'] = (df['purchase_date'].dt.floor('d') + 
                               pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1))
    print (df)
       user_id       purchase_date      month
    0        1 2015-01-23 14:05:21 2015-01-01
    1        2 2015-02-05 05:07:30 2015-02-01
    2        3 2015-02-18 17:08:51 2015-02-01
    3        4 2015-03-21 17:07:30 2015-03-01
    4        5 2015-03-11 18:32:56 2015-03-01
    5        6 2015-03-03 11:02:30 2015-03-01
    

    df['month'] = ((df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1))
                             .dt.floor('d'))
    print (df)
       user_id       purchase_date      month
    0        1 2015-01-23 14:05:21 2015-01-01
    1        2 2015-02-05 05:07:30 2015-02-01
    2        3 2015-02-18 17:08:51 2015-02-01
    3        4 2015-03-21 17:07:30 2015-03-01
    4        5 2015-03-11 18:32:56 2015-03-01
    5        6 2015-03-03 11:02:30 2015-03-01
    

    Last solution is create month period by to_period:

    df['month'] = df['purchase_date'].dt.to_period('M')
    print (df)
       user_id       purchase_date   month
    0        1 2015-01-23 14:05:21 2015-01
    1        2 2015-02-05 05:07:30 2015-02
    2        3 2015-02-18 17:08:51 2015-02
    3        4 2015-03-21 17:07:30 2015-03
    4        5 2015-03-11 18:32:56 2015-03
    5        6 2015-03-03 11:02:30 2015-03
    

    ... and then to datetimes by to_timestamp, but it is a bit slowier:

    df['month'] = df['purchase_date'].dt.to_period('M').dt.to_timestamp()
    print (df)
       user_id       purchase_date      month
    0        1 2015-01-23 14:05:21 2015-01-01
    1        2 2015-02-05 05:07:30 2015-02-01
    2        3 2015-02-18 17:08:51 2015-02-01
    3        4 2015-03-21 17:07:30 2015-03-01
    4        5 2015-03-11 18:32:56 2015-03-01
    5        6 2015-03-03 11:02:30 2015-03-01
    

    There are many solutions, so:

    Timings (in pandas 1.2.3):

    rng = pd.date_range('1980-04-01 15:41:12', periods=100000, freq='20H')
    df = pd.DataFrame({'purchase_date': rng})  
    print (df.head())
    
    
    
    In [70]: %timeit df['purchase_date'].to_numpy().astype('datetime64[M]')
    8.6 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In [71]: %timeit df['purchase_date'].dt.floor('d') + pd.offsets.MonthEnd(n=0) - pd.offsets.MonthBegin(n=1)
    23 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    In [72]: %timeit (df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1)).dt.floor('d')
    23.6 ms ± 97.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    In [73]: %timeit df['purchase_date'].dt.to_period('M')
    9.25 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In [74]: %timeit df['purchase_date'].dt.to_period('M').dt.to_timestamp()
    17.6 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    
    In [76]: %timeit df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(normalize=True)
    23.1 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    In [77]: %timeit df['purchase_date'].dt.normalize().map(MonthBegin().rollback)
    1.66 s ± 7.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)