I have a dataframe of people and I am trying to express their age with decimals in the dataframe relative to a specific date that I am getting from the name of the CSV. This code works until I get to a leap year in my dataset.
So for 2016-people.csv
I have the following:
name,dob
John,1989-12-14
Luc,1985-01-10
Picard,1987-10-13
Patrick,1986-02-19
Swayze,1987-06-10
My code is as follows:
# get csv from directory
df = pd.read_csv(csv, index_col=0, parse_dates=["dob"])
# form date string
date_string = re.split(r'\s|-', csv)[0] + '0915'
d_date = pd.to_datetime(int(date_string),format='%Y%m%d', errors='coerce')
# adding age column
df['age'] = df['dob'].apply(
lambda x: (d_date.year - x.year -
((d_date.month, d_date.day) < (x.month, x.day)))
)
# calculating fraction of year passed since last birthday
df['fractional_age'] = df['dob'].apply(
lambda x: (d_date - pd.to_datetime(x.replace(year=d_date.year))).days / 365.25
if (x.month, x.day) <= (d_date.month, d_date.day)
else (d_date - pd.to_datetime(x.replace(year=d_date.year - 1))).days / 365.25
)
df['age'] = df['age'] + df['fractional_age']
df = df.drop(columns=['fractional_age'])
Which works. I get the following:
name,dob,age
John,1989-12-14,26.7556468172484
Luc,1985-01-10,31.6817248459958
Picard,1987-10-13,28.9253935660506
Patrick,1986-02-19,30.5722108145106
Swayze,1987-06-10,29.2655715263518
However, if I pass a CSV into this function that contains a dob on February 29th of a leap year, it throws the following value error:
File "...\addage.py", line 22, in add_age
df['fractional_age'] = df['dob'].apply(
^^^^^^^^^^^^^^^^
File "...\AppData\Roaming\Python\Python311\site-packages\pandas\core\series.py", line 4630, in apply
return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...\AppData\Roaming\Python\Python311\site-packages\pandas\core\apply.py", line 1025, in apply
return self.apply_standard()
^^^^^^^^^^^^^^^^^^^^^
File "...\AppData\Roaming\Python\Python311\site-packages\pandas\core\apply.py", line 1076, in apply_standard
mapped = lib.map_infer(
^^^^^^^^^^^^^^
File "pandas\_libs\lib.pyx", line 2834, in pandas._libs.lib.map_infer
File "...\addage.py", line 23, in <lambda>
lambda x: (d_date - pd.to_datetime(x.replace(year=d_date.year))).days / 365.25
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pandas\_libs\tslibs\timestamps.pyx", line 2279, in pandas._libs.tslibs.timestamps.Timestamp.replace
File "pandas\_libs\tslibs\timestamps.pyx", line 155, in pandas._libs.tslibs.timestamps.create_timestamp_from_ts
ValueError: day is out of range for month
How do I resolve this?
I found a solution for this, which works enough for me. I instead converted the dob to seconds and then convert it back later on.
df['age'] = (d_date - df['dob']).dt.total_seconds() / (60*60*24*365.25)