pythonpandas

WIth pandas, how do I use np.where with nullable datetime colums?


np.where is great for pandas since it's a vectorized way to change the column based on a condition.

But while it seems to work great with np native types, it doesn't play nice with dates.

This works great:

>>> df1 = pd.DataFrame([["a", 1], ["b", np.nan]], columns=["name", "num"])
>>> df1
  name  num
0    a  1.0
1    b  NaN
>>> np.where(df1["num"] < 2, df1["num"], np.nan)
array([ 1., nan])

But this doesn't:

>>> df2 = pd.DataFrame([["a", datetime.datetime(2024,1,2)], ["b", np.nan]], columns=["name", "date"])
>>> df2
  name       date
0    a 2024-01-02
1    b        NaT
>>> np.where(df2["date"] < datetime.datetime(2024,3,1), df2["date"], np.nan)
Traceback (most recent call last):
  File "<python-input-10>", line 1, in <module>
    np.where(df2["date"] < datetime.datetime(2024,3,1), df2["date"], np.nan)
    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
numpy.exceptions.DTypePromotionError: The DType <class 'numpy.dtypes.DateTime64DType'> could not be promoted by <class 'numpy.dtypes._PyFloatDType'>. This means that no common DType exists for the given inputs. For example they cannot be stored in a single array unless the dtype is `object`. The full list of DTypes is: (<class 'numpy.dtypes.DateTime64DType'>, <class 'numpy.dtypes._PyFloatDType'>)
>>> 

What is the proper vectorized way to do the latter operation?


Solution

  • The best answer is to use Series.where:

    df2['out'] = df2['date'].where(df2["date"] < datetime.datetime(2024,3,1))
    

    As a second best answer, you can use NaT.

    numpy.where returns an array with a single dtype, you should not use NaN as an empty value but NaT:

    np.where(df2["date"] < datetime.datetime(2024,3,1), df2["date"], pd.NaT)
    

    Output: array([1704153600000000000, NaT], dtype=object)

    Note the the output values are integers (datetimes are integers).

    If you want to assign a column:

    df2['out'] = pd.to_datetime(np.where(df2["date"] < datetime.datetime(2024,3,1),
                                         df2["date"], pd.NaT))
    

    Output:

      name       date        out
    0    a 2024-01-02 2024-01-02
    1    b        NaT        NaT