pythonpandaspython-datetimedatetime64

Does the unit passed to the datetime64 data type in pandas do anything?


Does the unit passed to the datetime64 data type in pandas do anything?

Consider this code:

import pandas as pd 
v1 = pd.DataFrame({'Date':['2020-01-01']*1000}).astype({'Date':'datetime64'})
v2 = pd.DataFrame({'Date':['2020-01-01']*1000}).astype({'Date':'datetime64[ns]'})
v3 = pd.DataFrame({'Date':['2020-01-01']*1000}).astype({'Date':'datetime64[ms]'})
v4 = pd.DataFrame({'Date':['2020-01-01']*1000}).astype({'Date':'datetime64[s]'})
v5 = pd.DataFrame({'Date':['2020-01-01']*1000}).astype({'Date':'datetime64[h]'})
v6 = pd.DataFrame({'Date':['2020-01-01']*1000}).astype({'Date':'datetime64[D]'})
v7 = pd.DataFrame({'Date':['2020-01-01']*1000}).astype({'Date':'datetime64[M]'})
v8 = pd.DataFrame({'Date':['2020-01-01']*1000}).astype({'Date':'datetime64[Y]'})


for v in [v1,v2,v3,v4,v5,v6,v7,v8]:
    x = v.iloc[0,0]
    print(x, type(x), x.to_datetime64(), v.memory_usage()['Date'])

It returns:

2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'> 2020-01-01T00:00:00.000000000 8000
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'> 2020-01-01T00:00:00.000000000 8000
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'> 2020-01-01T00:00:00.000000000 8000
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'> 2020-01-01T00:00:00.000000000 8000
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'> 2020-01-01T00:00:00.000000000 8000
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'> 2020-01-01T00:00:00.000000000 8000
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'> 2020-01-01T00:00:00.000000000 8000
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'> 2020-01-01T00:00:00.000000000 8000

Solution

  • First of all: The Pandas version of the datetime64 type only timezone support. Specifically, when you try to a datetime64 variant in a Pandas series, it'll only support as (attosecond), fs (femtosecond), ps (picosecond) and ns (nanosecond) resolutions, anything less precise is replaced by datetime64[ns]. The datetime64[<res>, <tz>] variant only accepts s (seconds), ms (milliseconds), us (microseconds) and ns resolutions. Don't confuse these with the numpy datetime64 type.

    For both Pandas and Numpy, the 2-letter abbreviation determines the resolution used to record the timestamps, and because the type is always stored as 64 bits, it determines the range of values you can store in it. It does not alter how much memory the type takes!

    From the numpy datetime64 Datetime Units documentation:

    Datetimes are always stored with an epoch of 1970-01-01T00:00. This means the supported dates are always a symmetric interval around the epoch, called “time span” in the table below.

    The length of the span is the range of a 64-bit integer times the length of the date or unit. For example, the time span for ‘W’ (week) is exactly 7 times longer than the time span for ‘D’ (day), and the time span for ‘D’ (day) is exactly 24 times longer than the time span for ‘h’ (hour).

    Your experiment won't show any difference in memory use, because the amount of memory doesn't change, only the resolution.

    Because Pandas wraps the numpy datetime64 type, and you can't actually create a series with anything other than datetime64[ns]; e.g. the DateTimeIndex dtype parameter is documented as accepting either a numpy.dtype or DatetimeTZDtype or str, default None, but that for numpy.dtype there is an additional restriction:

    Note that the only NumPy dtype allowed is ‘datetime64[ns]’.

    So to demonstrate what the effect of different units, you'd have to use the numpy type directly:

    >>> import numpy as np
    >>> for unit in ('Y', 'M', 'W', 'D', 'h', 'm', 's', 'ms', 'us', 'ns'):   # ps, fs and as have too small a span
    ...     print(unit, np.array(["2021-02-27T12:24:17.524627869"], dtype=f"datetime64[{unit}]"))
    ...
    Y ['2021']
    M ['2021-02']
    W ['2021-02-25']
    D ['2021-02-27']
    h ['2021-02-27T12']
    m ['2021-02-27T12:24']
    s ['2021-02-27T12:24:17']
    ms ['2021-02-27T12:24:17.524']
    us ['2021-02-27T12:24:17.524627']
    ns ['2021-02-27T12:24:17.524627869']
    

    Note: The documentation for Pandas only ever talks about ns resolutions for the datetime64 types, and it appears from various issues on GitHub that while some of the codebase supports the other (finer) resolutions, this support is not reliable or widely supported by everything in the library. Your mileage may vary.