pythonpandasdataframe

DataFrame with all NaT should be timedelta and not datetime


I have a DataFrame with a column min_latency, which represents the minimum latency achieved by a predictor. If the predictor failed, there's no value, and therefore it returns min_latency=pd.NaT.

The dataframe is created from a dict, and if and only if all the rows have a pd.NaT value, the resulting column will have a datetime64[ns] dtype. It seems impossible to convert it to timedelta.

df = pd.DataFrame([{'id': i, 'min_latency': pd.NaT} for i in range(10)])
print(df['min_latency'].dtype) # datetime64[ns]
df['min_latency'].astype('timedelta64[ns]') # TypeError: Cannot cast DatetimeArray to dtype timedelta64[ns]

This problem doesn't happen if there's some timedelta in there:

df = pd.DataFrame([{'id': i, 'min_latency': pd.NaT} for i in range(10)] + [{'id': -1, 'min_latency': dt.timedelta(seconds=3)}])
print(df['min_latency'].dtype) # timedelta64[ns]

Solution

  • Naturally, the best thing would be to adjust the return value, using np.timedelta64 instead of pd.NaT.

    import numpy as np
    
    df = pd.DataFrame([{'id': i, 'min_latency': np.timedelta64('NaT', 'ns')} 
                       for i in range(3)]
                      )
    

    Output:

    df['min_latency']
    
    0   NaT
    1   NaT
    2   NaT
    Name: min_latency, dtype: timedelta64[ns]
    

    If that is not an option, you can check is_datetime64_dtype. If True, first use Series.values to return the column as ndarray and then apply np.ndarray.astype:

    from pandas.api.types import is_datetime64_dtype
    
    df = pd.DataFrame([{'id': i, 'min_latency': pd.NaT} 
                       for i in range(3)]
                      )
    
    if is_datetime64_dtype(df['min_latency']):
        df['min_latency'] = df['min_latency'].values.astype('timedelta64[ns]')
    

    Output:

    df['min_latency']
    
    0   NaT
    1   NaT
    2   NaT
    Name: min_latency, dtype: timedelta64[ns]
    

    If you want to rely solely on pandas, you will first need to change values of df['min_latency'] into values that can be understood as a duration. E.g., using pd.to_timedelta + Series.dt.nanosecond:

    if is_datetime64_dtype(df['min_latency']):
        df['min_latency'] = pd.to_timedelta(df['min_latency'].dt.nanosecond, 
                                            unit='ns')