python pandas dataframe datetime julian-date

Python Pandas Dataframe convert julian day column to date column

I have a column full of julian days. I want to convert them to date format.

df1.shape()
(638765, 94)

df1['DAY'] = 
0         2022216 # Format = year (2022),day of the year (216)
1         2022216
2         2022216
3         2022216
4         2022216

from datetime import datetime

Solution-1:

%timeit df1['Date'] = df1['DAY'].apply(lambda x: datetime.strptime('%d'%x,'%Y%j').strftime('%Y-%m-%d'))
11.9 s ± 50.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
0        2022-08-04
1        2022-08-04
2        2022-08-04
3        2022-08-04
4        2022-08-04

Solution-2:

%timeit df1['Date'] = pd.to_datetime(df1['DAY'], format='%Y%j')

20.3 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
0        2022-08-04
1        2022-08-04
2        2022-08-04
3        2022-08-04
4        2022-08-04

I liked the above solution-2 very much, since it takes less than a second and other one takes around 11 s. My question: is this default behavior for the to_datetime to convert a given julian day to the date format ('%Y-%m-%d') even though I did not specify it?

Solution

That is just a coincidence.

In your 1st solution, you strftime to the "%Y-%m-%d" format and it happened that the function to_datetime (that you use in your 2nd solution) returns a DatetimeIndex with this same format. And that's because "%Y-%m-%d" is the default format of the latter.

Under the hood, the conversion of your julian days is handled (at the very end of the processing) by _box_as_indexlike and this one converts a numpy array that holds the parsed datetimes to a DatetimeIndex :

#1 the input :
array([2022216, 2022216, 2022216, 2022216, 2022216], dtype=object)

#2 `array_stptime` gives :
array(['2022-08-04T00:00:00.000000000', '2022-08-04T00:00:00.000000000',
       '2022-08-04T00:00:00.000000000', '2022-08-04T00:00:00.000000000',
       '2022-08-04T00:00:00.000000000'], dtype='datetime64[ns]')

#3 `_box_as_indexlike` gives :
DatetimeIndex(['2022-08-04', '2022-08-04', '2022-08-04', '2022-08-04',
               '2022-08-04'], dtype='datetime64[ns]', freq=None)