np.where
is great for pandas since it's a vectorized way to change the column based on a condition.
But while it seems to work great with np
native types, it doesn't play nice with dates.
This works great:
>>> df1 = pd.DataFrame([["a", 1], ["b", np.nan]], columns=["name", "num"])
>>> df1
name num
0 a 1.0
1 b NaN
>>> np.where(df1["num"] < 2, df1["num"], np.nan)
array([ 1., nan])
But this doesn't:
>>> df2 = pd.DataFrame([["a", datetime.datetime(2024,1,2)], ["b", np.nan]], columns=["name", "date"])
>>> df2
name date
0 a 2024-01-02
1 b NaT
>>> np.where(df2["date"] < datetime.datetime(2024,3,1), df2["date"], np.nan)
Traceback (most recent call last):
File "<python-input-10>", line 1, in <module>
np.where(df2["date"] < datetime.datetime(2024,3,1), df2["date"], np.nan)
~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
numpy.exceptions.DTypePromotionError: The DType <class 'numpy.dtypes.DateTime64DType'> could not be promoted by <class 'numpy.dtypes._PyFloatDType'>. This means that no common DType exists for the given inputs. For example they cannot be stored in a single array unless the dtype is `object`. The full list of DTypes is: (<class 'numpy.dtypes.DateTime64DType'>, <class 'numpy.dtypes._PyFloatDType'>)
>>>
What is the proper vectorized way to do the latter operation?
The best answer is to use Series.where
:
df2['out'] = df2['date'].where(df2["date"] < datetime.datetime(2024,3,1))
As a second best answer, you can use NaT
.
numpy.where
returns an array with a single dtype, you should not use NaN
as an empty value but NaT
:
np.where(df2["date"] < datetime.datetime(2024,3,1), df2["date"], pd.NaT)
Output: array([1704153600000000000, NaT], dtype=object)
Note the the output values are integers (datetimes are integers).
If you want to assign a column:
df2['out'] = pd.to_datetime(np.where(df2["date"] < datetime.datetime(2024,3,1),
df2["date"], pd.NaT))
Output:
name date out
0 a 2024-01-02 2024-01-02
1 b NaT NaT