I have a column full of julian days. I want to convert them to date format.
df1.shape()
(638765, 94)
df1['DAY'] =
0 2022216 # Format = year (2022),day of the year (216)
1 2022216
2 2022216
3 2022216
4 2022216
from datetime import datetime
Solution-1:
%timeit df1['Date'] = df1['DAY'].apply(lambda x: datetime.strptime('%d'%x,'%Y%j').strftime('%Y-%m-%d'))
11.9 s ± 50.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
0 2022-08-04
1 2022-08-04
2 2022-08-04
3 2022-08-04
4 2022-08-04
Solution-2:
%timeit df1['Date'] = pd.to_datetime(df1['DAY'], format='%Y%j')
20.3 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
0 2022-08-04
1 2022-08-04
2 2022-08-04
3 2022-08-04
4 2022-08-04
I liked the above solution-2 very much, since it takes less than a second and other one takes around 11 s. My question: is this default behavior for the to_datetime
to convert a given julian day to the date format ('%Y-%m-%d') even though I did not specify it?
That is just a coincidence.
In your 1st solution, you strftime
to the "%Y-%m-%d"
format and it happened that the function to_datetime
(that you use in your 2nd solution) returns a DatetimeIndex
with this same format. And that's because "%Y-%m-%d"
is the default format of the latter.
Under the hood, the conversion of your julian days is handled (at the very end of the processing) by _box_as_indexlike
and this one converts a numpy array that holds the parsed datetimes to a DatetimeIndex
:
#1 the input :
array([2022216, 2022216, 2022216, 2022216, 2022216], dtype=object)
#2 `array_stptime` gives :
array(['2022-08-04T00:00:00.000000000', '2022-08-04T00:00:00.000000000',
'2022-08-04T00:00:00.000000000', '2022-08-04T00:00:00.000000000',
'2022-08-04T00:00:00.000000000'], dtype='datetime64[ns]')
#3 `_box_as_indexlike` gives :
DatetimeIndex(['2022-08-04', '2022-08-04', '2022-08-04', '2022-08-04',
'2022-08-04'], dtype='datetime64[ns]', freq=None)